Plan of the lecture

Introduction: Information Extraction and Named Entity Recognition (NER)
NER: definitions and tasks (extraction, classification, disambiguation)
basic programming concepts in Python
Doing NER with existing libraries:
- NER from Latin texts with CLTK
- NER from journal articles with NLTK

Python: basic concepts

Python is a very flexible and very powerful programming language that can help you working with texts and corpora. Python's phylosophy emphasizes code readability and features a simple and very expressive syntax. It is actually easy to master the basic aspects of Python's syntax: it is amazing how much you can do even with just the most basic concepts... The aim of these two lectures is to introduce to you some of these basic operation, let you see some code in action and also give you some exercise where you can apply what you've seen.

It is also amazing how many thing you can accomplish with some well written lines of Python! By the end of this class, we'd like to show you how you use Python to perform (some) Natural Language Processing. But of course, you can even just use Python do somethin as easy as...



In [1]:

    
2 + 3









    Out[1]:





5

Variables and data types

Here we go! we've written our first line of code... But I guess we want to do something a little more interesting, right? Well, for a start, we might want to use Python to execute some operation (say: sum two numbers like 2 and 3) and process the result to print it on the screen, process it, and reuse it as many time as we want...

Variables is what we use to store values. Think of it as a shoebox where you place your content; next time you need that content (i.e. the result of a previous operation, or for example some input you've read from a file) you simply call the shoebox name...



In [3]:

    
result = 2 - 2



In [4]:

    
#now we print the result
print(result)



In [5]:

    
# by the way, I'm a comment. I'm not executed
# every line of code following the sign # is ignored:
# print("I'm line n. 3: do you see me?")
# see? You don't see me...
print("I'm line nr. 5 and you DO see me!")









    



I'm line nr. 5 and you DO see me!

That's it! As easy as that (yes, in some programming languages you have to create or declare the variable first and then use it to fill the shoebox; in Python, you go ahead and simply use it!)

Now, what do you think we will get when we execute the following code?



In [7]:

    
result + 8









    Out[7]:





8

What types of values can we put into a variable? What goes into the shoebox? We can start by the members of this list:

Integers (-1,0,1,2,3,4...)
Strings ("Hello", "s", "Wolfgang Amadeus Mozart", "I am the α and the ω!"...)
floats (3.14159; 2.71828...)
Booleans (True, False)

If you're not sure what type of value you're dealing with, you can use the function type(). Yes, it works with variables too...!



In [6]:

    
type("I am the α and the ω!")









    Out[6]:





str



In [7]:

    
type(2.7182818284590452353602874713527)









    Out[7]:





float



In [9]:

    
type(True)









    Out[9]:





bool



In [15]:

    
result = "hello"



In [16]:

    
type(results)









    Out[16]:





str

You declare strings with single ('') or double ("") quote: it's totally indifferent! But now two questions:

what happens if you forget the quotes?
what happens if you put quotes around a number?



In [10]:

    
mionome = "lucia"
print(mionome)









    



lucia



In [11]:

    
print("ciao")









    



ciao



In [13]:

    
type("4")









    Out[13]:





str

String, integer, float... Why is that so important? Well, try to sum two strings and see what happens...



In [22]:

    
"2" + "3"









    Out[22]:





'23'



In [23]:

    
#probably you wanted this...
int("2") + int("3")









    Out[23]:





5

But if we are working with strings, then the "+" sign is used to concatenate the strings:



In [24]:

    
a = "interesting!"
print("not very " + a)









    



not very interesting!

Lists and dictionaries

Lists and dictionaries are two very useful types to store whole collections of data



In [14]:

    
beatles = ["John", "Paul", "George", "Ringo"]
type(beatles)









    Out[14]:





list



In [15]:

    
# dictionaries collections of key : value pairs
beatles_dictionary = { "john" : "John Lennon" ,
                      "paul" : "Paul McCartney",
                      "george" : "George Harrison",
                      "ringo" : "Ringo Starr"}
type(beatles_dictionary)









    Out[15]:





dict

(there are also other types of collection, like Tuples and Sets, but we won't talk about them now; read the links if you're interested!)

Items in list are accessible using their index. Do remember that indexing starts from 0!



In [20]:

    
print(beatles[3], [1])









    



Ringo [1]



In [28]:

    
#indexes can be negative!
beatles[-1]









    Out[28]:





'Ringo'

Dictionaries are collections of key : value pairs. You access the value using the key as index



In [21]:

    
beatles_dictionary["paul"]









    Out[21]:





'Paul McCartney'



In [22]:

    
beatles_dictionary[0]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-31e6fcd3d0e7> in <module>()
----> 1 beatles_dictionary[0]

KeyError: 0

There are a bunch of methods that you can apply to list to work with them.

You can append items at the end of a list



In [23]:

    
beatles.append("randomname")
beatles









    Out[23]:





['John', 'Paul', 'George', 'Ringo', 'randomname']

You can learn the index of an item



In [24]:

    
beatles.index("Paul")









    Out[24]:





1

You can insert elements at a predefinite index:



In [33]:

    
beatles.insert(0, "Pete Best")
print(beatles.index("George"))
beatles









    



7






    Out[33]:





['Pete Best',
 'Pete Best',
 'John',
 'Liam',
 'Liam',
 'Liam',
 'Paul',
 'George',
 'Ringo',
 'randomname']

But most importantly, you can slice lists, producing sub-lists by specifying the range of indexes you want:



In [34]:

    
beatles[1:5]









    Out[34]:





['Pete Best', 'John', 'Liam', 'Liam']

Do you notice something strange? Yes, the limit index is not inclusive (i.e. item beatles[5] is not included)



In [35]:

    
beatles[5]









    Out[35]:





'Liam'

What happens if you specify an index that is too high?



In [36]:

    
beatles[7]









    Out[36]:





'George'

How can you know how long a list is?



In [37]:

    
len(beatles)









    Out[37]:





10

Do remember that indexing starts at 0, so don't make the mistake of thinking that len(yourlist) will give you the last item of your list!



In [38]:

    
beatles[len(beatles)]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-38-00884e355894> in <module>()
----> 1 beatles[len(beatles)]

IndexError: list index out of range

This will work!



In [39]:

    
beatles[len(beatles) -1]









    Out[39]:





'randomname'

If-statements

Most of the times, what you want to do when you program is to check a value and execute some operation depending on whether the value matches some condition. That's where if statements help!

In its easiest form, an If statement is syntactic construction that checks whether a condition is met; if it is some part of code is executed



In [43]:

    
basist = "Paul McCartney"

if bassist == "Paul McCartney":
    print("Paul played bass with the Beatles!")









    



Paul played bass with the Beatles!

Mind the indentation very much! This is the essential element in the syntax of the statement



In [44]:

    
bassist = "Bill Wyman"

if bassist == "Paul McCartney":
    print("I'm part of the if statement...")
    print("Paul played bass in the Beatles!")

What happens if the condition is not met? Nothing! The indented code is not executed, because the condition is not met, so lines 4 and 5 are simply skipped.

But what happens if we de-indent line 5? Can you guess why this is what happes?

Most of the time, we need to specify what happens if the conditions are not met



In [45]:

    
bassist = ""

if bassist == "Paul McCartney":
    print("Paul played bass in the Beatles!")
else:
    print("This guy did not play for the Beatles...")









    



This guy did not play for the Beatles...

This is the flow:

the condition in line 3 is checked
is it met?
- yes: then line 4 is executed
- no: then line 6 is executed

Or we can specify many different conditions...



In [52]:

    
bassist = "Mike"

if bassist == "Paul McCartney":
    print("Paul played bass in the Beatles!")
elif bassist == "Bill Wyman":
    print("Bill Wyman played for the Rolling Stones!")
elif bassist == "Lucia":
    print ("Lucia played for Oasis")
elif bassist == "Mike":
    print ("Mike played for xxx")
else:
    print("I don't know what band this guy played for...")









    



Mike played for xxx

For loops

The greatest thing about lists is that thet are iterable, that is you can loop through them. What do we do if we want to apply some line of code to each element in a list? Try with a for loop!

A for loop can be paraphrased as: "for each element named x in an iterable (e.g. a list): do some code (e.g. print the value of x)"



In [55]:

    
for b in beatles:
    print(b + " was one of the Beatles")









    



Pete Best was one of the Beatles
Pete Best was one of the Beatles
John was one of the Beatles
Liam was one of the Beatles
Liam was one of the Beatles
Liam was one of the Beatles
Paul was one of the Beatles
George was one of the Beatles
Ringo was one of the Beatles
randomname was one of the Beatles

Let's break the code down to its parts:

b: an arbitrary name that we give to the variable holding every value in the loop (it could have been any name; b is just very convenient in this case!)
beatles: the list we're iterating through
: as in the if-statements: don't forget the colon!
indent: also, don't forget to indent this code! it's the only thing that is telling python that line 2 is part of the for loop!
line 2: the function that we want to execute for each item in the iterables

Now, let's join if statements and for loop to do something nice...



In [57]:

    
beatles = ["John", "Paul", "George", "Ringo", "Lucia"]
for b in beatles:
    if b == "Paul":
        instrument = "bass"
    elif b == "John":
        instrument = "rhythm guitar"
    elif b == "George":
        instrument = "lead guitar"
    elif b == "Ringo":
        instrument = "drum"
    elif b == "Lucia":
        instrument = "piano"
    print(b + " played " + instrument + " with the Beatles")









    



John played rhythm guitar with the Beatles
Paul played bass with the Beatles
George played lead guitar with the Beatles
Ringo played drum with the Beatles
Lucia played piano with the Beatles

Input and Output

One of the most frequent tasks that programmers do is reading data from files, and write some of the output of the programs to a file.

In Python (as in many language), we need first to open a file-handler with the appropriate mode in order to process it. Files can be opened in:

read mode ("r")
write mode ("w")
append mode

Let's try to read the content of one of the txt files of our Sunoikisis directory

First, we open the file handler in read mode:



In [58]:

    
#see? we assign the file-handler to a variable, or we wouldn't be able
#to do anything with that!
f =  open("NOTES.md", "r")

note that "r" is optional: read is the default mode!

Now there are a bunch of things we can do:

read the full content in one variable with this code:

content = f.read()

read the lines in a list of lines:

lines = f.readlines()

or, which is the easiest, simply read the content one line at the time with a for loop; the f object is iterable, so this is as easy as:



In [ ]:

    
for l in f:
    print(l)

Once you're done, don't forget to close the handle:



In [60]:

    
f.close()



In [ ]:

    
#all together
f = open("NOTES.md")
for l in f:
    print(l)
f.close()

Now, there's a shortcut statement, which you'll often see and is very convenient, because it takes care of opening, closing and cleaning up the mess, in case there's some error:



In [61]:

    
with open("NOTES.md") as f:
    #mind the indent!
    for l in f:
        #double indent, of course!
        print(l)









    



Password = ~Sun0ikiS1s!2oi62oi7#



	pip install jupyter

	jupyter notebook --generate-config

Now, how about writing to a file? Let's try to write a simple message on a file; first, we open the handler in write mode



In [ ]:

    
out = open("test.txt", "w")



In [ ]:

    
#the file is now open; let's write something in it
out.write("Mio test This is a test!\nThis is a second line (separated with a new-line feed)")

The file has been created! Let's check this out



In [ ]:

    
#don't worry if you don't understand this code!
#We're simply listing the content of the current directory...
import os
os.listdir()

But before we can do anything (e.g. open it with your favorite text editor) you have to close the file-handler!



In [ ]:

    
out.close()

Let's look at its content



In [ ]:

    
with open("test.txt") as f:
    print(f.read())

Again, also for writing we can use a with statement, which is very handy.

But let's have a look at what happens here, so we understand a bit better why "write mode" must be used carefully!



In [ ]:

    
with open("test.txt", "w") as out:
    out.write("Oooops! new content")

Let's have a look at the content of "test.txt" now



In [ ]:

    
with open("test.txt") as f:
    print(f.read())

See? After we opened the file in "write mode" for the second time, all content of the file was erased and replaced with the new content that we wrote!!!

So keep in mind: when you open a file in "w" mode:

if it doesn't exist, a new file with that name is created
if it does exist, it is completely overwritten and all previous content is lost

If you want to write content to an existing file without losing its pervious content, you have to open the file with the "a" mode:



In [69]:

    
with open("test.txt", "a") as out:
    out.write('''\nAnd this is some additional content.
The new content is appended at the bottom of the existing file''')



In [70]:

    
with open("test.txt") as f:
    print(f.read())









    



Oooops! new content
And this is some additional content.
The new content is appended at the bottom of the existing file

Functions

Above, we have opened a file several times to inspect its content. Each time, we had to type the same code over and over. This is the typical case where you would like to save some typing (and write code that is much easier to maintain!) by defining a function

A function is a block of reusable code that can be invoked to perform a definite task. Most often (but not necessarily), it accepts one or more arguments and return a certain value.

We have already seen one of the built-in functions of Python: print("some str")

But it's actually very easy to define your own. Let's define the function to print out the file content, as we said before. Note that this function takes one argument (the file name) and prints out some text, but doesn't return back any value.



In [71]:

    
def printFileContent(file_name):
    #the function takes one argument: file_name
    with open(file_name) as f:
        print(f.read())

As usual, mind the indent!

file_name (line 1) is the placeholder that we use in the function for any argument that we want to pass to the function in our real-life reuse of the code.

Now, if we want to use our function we simply call it with the file name that we want to print out



In [73]:

    
printFileContent("README.md")









    



# SunoikisisDC_NER
The materials for the SunoikisisDC sessions on Named Entity Extraction (2016/2017)

Now, let's see an example of a function that returns some value to the users. Those functions typically take some argument, process them and yield back the result of this processing.

Here's the easiest example possible: a function that takes two numbers as arguments, sum them and returns the result.



In [75]:

    
def sumTwoNumbers(first_int, second_int):
    s = first_int + second_int
    return s



In [ ]:

    
#could be even shorter:
def sumTwoNumbers(first_int, second_int):
    return first_int + second_int



In [76]:

    
sumTwoNumbers(5, 6)









    Out[76]:





11

Most often, you want to assign the result returned to a variable, so that you can go on working with the results...



In [78]:

    
s = sumTwoNumbers(5,6)
s * 2









    Out[78]:





22

Error and exceptions

Things can go wrong, especially when you're a beginner. But no panic! Errors and exceptions are actually a good thing! Python gives you detailed reports about what is wrong, so read them carefully and try to figure out what is not right.

Once you're getting better, you'll actually learn that you can do something good with the exceptions: you'll learn how to handle them, and to anticipate some of the most common problems that dirty data can face you with...

Now, what happens if you forget the all-important syntactic constraint of the code indent?



In [80]:

    
if 1 > 0:
    print("Well, we know that 1 is bigger than 0!")









    



Well, we know that 1 is bigger than 0!

Pretty clear, isn't it? What you get is an error a construct that is not grammatical in Python's syntax. Note that you're also told where (at what line, and at what point of the code) your error is occurring. That is not always perfect (there are cases where the problem is actually occuring before what Python thinks), but in this case it's pretty OK.

What if you forget to define a variable (or you misspell the name of a variable)?



In [84]:

    
var = "bla bla"
if var1:
    print("If you see me, then I was defined...")









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-84-f4ea1c51d2f6> in <module>()
      1 var = "bla bla"
----> 2 if var1:
      3     print("If you see me, then I was defined...")

NameError: name 'var1' is not defined

You get an exception! The syntax of your code is right, but the execution met with a problem that caused the program to stop.

Now, in your program, you can handle selected exception: this means that you can write your code in a way that the program would still be executed even if a certain exception is raised.

Let's see what happens if we use our function to try to print the content of a file that doesn't exist:



In [85]:

    
printFileContent("file_that_is_not_there.txt")









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-85-501b445822cd> in <module>()
----> 1 printFileContent("file_that_is_not_there.txt")

<ipython-input-71-09207a2f323d> in printFileContent(file_name)
      1 def printFileContent(file_name):
      2     #the function takes one argument: file_name
----> 3     with open(file_name) as f:
      4         print(f.read())

FileNotFoundError: [Errno 2] No such file or directory: 'file_that_is_not_there.txt'

We get a FileNotFoundError! Now, let's re-write the function so that this event (somebody uses the function with a wrong file name) is taken care of...



In [86]:

    
def printFileContent(file_name):
    #the function takes one argument: file_name
    try:
        with open(file_name) as f:
            print(f.read())
    except FileNotFoundError:
        print("The file does not exist.\nNevertheless, I do like you, and I will print something to you anyway...")



In [87]:

    
printFileContent("file_that_doesnt_exist.txt")









    



The file does not exist.
Nevertheless, I do like you, and I will print something to you anyway...

Appendix: useful links

Python: how to install

If you're using Mac OSX or Linux, you already have (at least one version) of Python installed. Anyway, it's very easy to install Python or upgrade your version. See:

https://wiki.python.org/moin/BeginnersGuide/Download

Jupyter: how to install

http://jupyter.org/install.html

Python and Jupyter come also in a pre-packaged environment (which is designed especially for data science) called Anaconda. You might be interested to look at that.

Python 2 or Python 3?

Python 3 is the latest version of Python (currently, 3.6.1). It's a major upgrade from Python 2, but the code has been somewhat dramatically changed in the passage from 2 to 3 and there is some problem of backward compatibility. Some version of Linux or Mac OSX still come with Python 2.7 (the final version of Python 2).

Anyway, Python 3 is currently in active development: it's where the cutting-edge improvements and new stuff are being developed (especially for NLP and the NLTK library). In this code, we assume Python 3!

https://wiki.python.org/moin/Python2orPython3

NLTK: Book

Would you like a book that is a great introduction to Python for absolute beginners, is a wonderfull resource to learn the basics of Natural Language processing and gives you a thorough introduction to the NLTK library to do NLP in Python? Oh, yeah, I was forgetting: that can be read for free on the internet? Yes, it's Christmass time!

http://www.nltk.org/book/