Input for your programs often comes from files on your disk, such as 'corpora' (a 'corpus' is what we call a large collection of digital text in linguistics). Likewise, you often want output to be written back to files on your disk as well. Thus, reading and writing files is often an essential part of programming and, lucky, for us, this is really simple in Python. The following example reads a file from disk:
In [ ]:
f = open('data/austen-emma-excerpt.txt', 'rt', 'utf-8')
text = f.read()
f.close()
print(text)
Watch out: the open()
function doesn't return the actual text that is saved in the text file. It only returns a so-called 'file object' from which we can read the content using the .read()
function. We passed three arguments to the open()
function:
UTF-8
You may wonder what an encoding is and what utf-8 is. For anyone working with texts and computers this is vital to know. Internally, a computer knows no characters whatsoever: every piece of information is represented as numbers (which in turn are represented in a binary format, as zeroes and ones). An encoding specifies which numbers represent which characters. A famous and long-standing encoding scheme is ASCII, in which for example the letter 'A' is encoded using the number 65. ASCII however only has a very limited alphabet and can not encode a lot of writing systems. A modern-day encoding supporting countless writing systems is unicode and utf-8 is a kind of unicode. This the type of encoding that you will want to use for your data whenever possible. Whenever you have a choice, you should use unicode!
Reading an entire file in one string is not always desirable, especially not with huge files. The following example reads up until a newline everytime, and returns one line at a time.
In [ ]:
f = open('data/austen-emma-excerpt.txt','rt', encoding='utf-8')
for line in f:
print(line)
print("n")
f.close()
The 'newline' character is probably something new to you. If you are dealing with plain text files (typically files whose name ends in the '.txt' extension), your machine uses a special character internally to signal that a new line should begin. Internally, such newlines are represented as "\n"
. Normally, this character is visualized on your screen as if the enter key were pressed. See what happens below:
In [ ]:
s = "This is the first line.\nThis is the second line."
print(s)
There exists a similar character to encode 'tab' characters, namely \t
. You can use this character to play around with the indentation of your (e.g. hierarchically structured) output:
In [ ]:
s = "First line\n\t* Second line\n\t* Third line\n\t* Fourth line\nFifth line"
print(s)
In the code block above in which you read the Austen file, the newline is still included with the original line that preceded it in the file: this is why you see all the extra empty lines in the output above! If you wish to remove all preceding and trailing whitespace in a string (newlines, spaces, but also tabs), you can use the strip()
function:
In [ ]:
s = " strip me! "
print(s.strip())
Now, try to adapt the code that read in the Austen file and have your code print each line without the preceding and trailing whitespace! Have the annoying "double lines" disappeared in your output?
In [ ]:
f = open('data/austen-emma-excerpt.txt','rt', encoding='utf-8')
for line in f:
print(line)
f.close()
Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note, that a line does not necessarily correspond to a sentence).
In [ ]:
count = 0
f = open('data/austen-emma-excerpt.txt', 'rt', encoding='utf-8')
for line in f:
count += 1
f.close()
print(count)
Read the file data/austen-emma-excerpt.txt
and compute the average length (in characters) of the lines.
In [ ]:
f = open('data/austen-emma-excerpt.txt', 'rt', encoding='utf-8')
# insert your code here
# important: always remember to properly close your files again!
Now we mastered the art of reading files, let's move on to writing files, which follows a similar logic:
In [ ]:
f = open('data/testoutput.txt', 'wt', encoding='utf-8')
f.write("Hello world!")
f.close()
In this code block, we have automatically created a new file called testoutput.txt
in the data
directory. We then wrote a single line to this file and then we closed it. Note that the w
in wt
is a crucial addition: if you would have left this out, Python would have opened the file in 'readonly' mode and you wouldn't have been able to write to it! The 't' in the argument, again, signifies that we will be writing to this file in plain text mode.
If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. Instead of:
In [ ]:
f = open('data/testouput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!")
f.write("Hello world on the second line!")
f.close()
You need to write:
In [ ]:
f = open('data/testoutput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!\n")
f.write("Hello world on the second line!")
f.close()
Otherwise your file would have Hello world!Hello world!
in it, i.e. without the newlines.
Besides 'read-mode' and 'write-mode' when dealing with text files, there is also the 'append-mode' in Python. Watch out: in 'write-mode', you will always overwrite the existing content of the file. However, if you've open a file in 'append-mode', everything you write to the file will be added at the end of the file, without deleting anything of the existing content in the file. In order to enable the append mode, you need to specify 'at'
as your second parameter when you open files ('a' for append mode; 't' for text mode).
Read the file data/austen-emma-excerpt-tokenised.txt
, and write to a file words.txt
all words occuring in this text (without duplicates!!), alphabetically ordered, one word per line. That way, you are really creating a lexicon or word list of the text. (Tip: you should use sets in this exercise!)
In [ ]:
# insert your code here
Check your output by viewing the words.txt
file in a text editor such as Sublime Text 2. (Windows users: do not use Notepad!)
Most of the functionality we have used thus far, is simply built into the Python language itself. Often however you need to use external modules in your code. A lot of external modules are already available in the Python Standard Library, for a wide variety of tasks. There are also countless third-party providers of Python modules.
To use an external module in your code you need to explicitly 'import' it. Consider for example the module random
from the standard library, which contains functions for generating random numbers:
In [ ]:
import random
print(random.randint(0, 10))
Note the syntax used: using the dot, we indicate that our machine should look for the randint()
function inside the random
module we just imported. You can import an entire module or import only a specific function in the module. We could also have imported the (single) function we needed as follows:
In [ ]:
from random import randint
print(randint(0, 100))
In this case we wouldn't have to specify where or machine should find the randint()
function. You can also (temporarily) change the names of the functions you import:
In [ ]:
from random import randint as random_int
print(random_int(0, 100))
So far, you have already seen lots of functions. In general, a function will do something for you, based on a number of input parameters you pass it, and it will typically return a result. You are not limited to using functions available from in the standard library or the ones provided by external parties. You can also write your own functions!
In fact, you must write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. In the rest of this chapter we will teach you how to write your own functions and how to import and re-use these functions in other, new scripts you write!
Let's start off with a trivial function. Functions are defined using the def
keyword, followed by the name you want your function to have and (optionally!) the names of the parameters that your function takes.
def some_name(optional_parameters):
# here goes your functionality
return my_result
The return
statement returns a value back to the caller and always ends the execution of the function. Mind the indentation here, which is how we make clear to the Python interpreter which lines belong to our function.
In [ ]:
def multiply(x, y):
result = x * y
return result
# or in shorthand
def multiply(x, y):
return x*y
# now that you defined this function, you can use it in the rest of your code:
z = multiply(2, 5)
print(z)
Now, let's define a more advanced function. The following function count_articles()
counts the number of articles (the, a, an) in a list of words. The words are passed to the function as a list of strings. Note that the function itself lowercases the words, so that you never have to take care of this again in the rest of your code! Can you change count_articles()
in such a manner that it will now accept the entire string and splits it into tokens internally? That way, the user of a function wouldn't have to care about this!
In [ ]:
def count_articles(tokens):
count = 0
for token in tokens:
if token.lower() in ('the', 'a', 'an'):
count += 1
return count
text = "A bit less trivial , this function counts the number of articles ( the , a , an ) in a list of words"
words = text.split()
print(count_articles(words))
We can also have a function return multiple values by combining them in a tuple (the type of ordered, immutable list we talked about in the previous chapter!). Python has a nice way of 'unpacking' such a tuple using assignment:
In [ ]:
def count(text):
words = text.split(" ")
word_count = len(words)
character_count = len(text)
return word_count, character_count
word_count, character_count = count("To be or not to be , that is the question .")
print(word_count)
print(character_count)
Note that your functions don't have to return anything explicitly. If you run the example below, you will see that such a 'void' function in reality returns 'None'.
In [ ]:
def little_grey(doctor):
print("Oh Dr", doctor, "!")
whatever = little_grey("Steamy")
print(whatever)
Any variables you declare in a function, as well as the parameters that are passed to a function will only exist within the 'scope' of that function, i.e. inside the function itself. The following code will produce an error, because the variable x
does not exist outside the function:
In [ ]:
def setx():
x = 1
setx()
print(x)
Also consider this:
In [ ]:
x = 0
def setx():
x = 1
setx()
print(x)
In fact, this code has produced two completely unrelated x
's!
Nevertheless, it is possible to read a global variable from within a function, in a strictly read-only fashion. But as soon as you assign something, the variable will be a local copy:
In [ ]:
x = 1
def getx():
print(x)
getx()
Write a function calculate_average(numbers)
, taking one parameter, a list (or other iterable) containing only numbers (you may assume that), and returning the average/mean:
In [ ]:
# write your function calculate_average here
# do not modify the code below, for testing only
numbers = [1, 2, 3, 4, 5]
av = calculate_average(numbers)
print(av == 3) # should be True if you did it correctly
Write a function textstats(filename)
that, given a filename to load return a three-tuple with the following general statistics on the file: number of lines, number of words/tokens, number of characters. Don't worry about tokenisation: you can just split the words along whitespace.
In [ ]:
# write your function textstats here
Up until now, we have been using the interactive IPython software to write our Python code. We have only been writing really small bits and pieces of code, however, instead of writing longer scripts that can provide a more significant batch of functionality. Now, let us make our first independent Python script together. (Note that this way of working will also resemble more closely your future day-to-day coding practice.)
Open 'Sublime Text 2', a popular text editor which we will use in this course (http://www.sublimetext.com/). Create a new file -- this might have happened automatically when you opened the editor -- and save it as "script.py" in a convenient location (here, we will assume that you have saved it in your Desktop folder. Note that files containing Python code typically take the ".py" extension.
If you are working in a UNIX-like environment (Mac or Linux), you should now add the following code on the very first line of your script:
In [ ]:
#!/usr/bin/env/ python
This line will tell your computer which language you want to use to run the script -- in this case, our default installation of Python 3 will be used. In technical terms, the "#!" is called a "shebang" indication. If you are working in MS Windows, you can add this 'shebang line' as well, but it will have no effect.
Now, let's us add a simply Python function to this file. The fib()
function will the first numbers in the famous Fibonacci series. The function will only print the items in the series that are smaller than upper
, i.e. the parameter we pass to this function:
In [ ]:
def fib(upper):
# write Fibonacci series up to upper
"Print a Fibonacci series up to upper"
a, b = 0, 1
while b < upper:
print(b)
a = b
b = a+b
return
Next, add a line that actually calls the fib()
function for upper=2000
. (Don't forget to take care of the correct indentation!)
In [ ]:
fib(2000)
Instead of executing this code by hitting ctrl+enter as we have done in the IPython notebook so far, we will now learn how to execute our code differently. We have two ways for doing this: an easy one, and a difficult one. When you work with a code editor like Sublime Text, there is often an easy way to execute your code. In Sublime, for instance, you can first save your file with a ".py" extension, and then 'build' your code by hitting Ctrl+b (Windows, Unix) or Command+b (Mac OS X). You will now see that the output of your script will be written to your screen in Sublime.
For the second option, you can use a command line interface or prompt. Watch out: this can be pretty scary at first... In general, you should always watch out when you use a command line interface to your machine: only execute commands that you (more or less) understand! You typically have complete control over your machine via such an interface, so you need to watch out not to remove any important files. (You could e.g. unintentionally delete your entire operating system from your hard drive with a single command...).
First we will deal with instructions for doing this in Mac OS X and Linux-distributions such as Ubuntu. Mac OS X and Linux tend to behave similarly because they are both 'Unix-based' operating systems.
Ctrl+Alt+T
to open a console window.Now 'cd' (= Change Directory) into the director that contains your script.py
file (in our case that would mean: cd ~/Desktop
). Next, execute our script by typing: python3 script.py
. With this command, we explicitly tell the machine to execute this script using python
(at least, the default version of Python3 installed on your machine). Normally, the output of the `print(), should now have been send to your console window. Has it?
However, because we added the 'shebang line' at the top of script.py
, we could have also used the plainer command ./script.py
(which simply means: 'run this program'). To make the program fully executable you might have to "chmod" it first (CHange file MODes), using the command chmod +x script.py
. With this command you tell your machine, that it is safe to execute this script. For additional info on these scripts and the options they take, you can always run the man
command (e.g. man chmod
). A good tutorial that covers the basics of the bash command line interface on Unix-like operating systems is: http://praxis.scholarslab.org/tutorials/bash/.
Under a Windows operating system, you can simply double-click the script.py function: because of the .py
extension your OS will automatically run the script via Python interpreter (note that the shebang line
is in reality ignored in your script.py
). Alternatively, go to Start > All programs > Accessories and click on Command Prompt. In the Search or Run line, you can also type cmd
and press enter. This will open a DOS console window.
Navigate to the folder that holds you script.py: in our case you could do this via the command: cd C:\documents and settings\your_username\desktop
or simply cd desktop
. Next, execute our script by typing: python3 script.py
. With this command, we explicitly tell the machine to execute this script using python
(at least it's default version on your machine). Normally, the output of the `print(), should now have been send to your console window. Has it? A good tutorial that covers the basics of the bash command line interface in DOS-based operating systems is: http://www.computerhope.com/issues/chusedos.htm
You can now also import the functionality from script.py
in other scripts! Remove (or comment out via a hashtag) the following line, containing the actual function call from script.py
In [ ]:
fib(2000)
Create a new file called main.py
in the same directory, namely your Desktop folder. Add the shebang line on top, as well as the following statement which will import the functionality from the script.py
module. Note that the syntax is entirely the same as for importing one of the 'official' functions from the Python Standard Library! Instead of running script.py
, now try to run main.py
which will import the script.fib()
function. Does this work out? You don't have to add the '.py' by the way: your computer will figure out this extension itself.
In [ ]:
#!/usr/bin/env python
import script
script.fib(upper=2000)
Note that we have to be explicit about where our Python interpreter should look for the fib() function using the syntax with a dot (module.function()
). If we want to be able to use the shorter version of the function call, we should have used to following import statement:
In [ ]:
from script import *
Can you now try try to run main.py
again, but now with the shorter call fib(upper=2000)
, witout explicitly mentioning the module from which the function originates? Does that work work out for you?
Now check out the files in your Desktop folder: you will notice that an additional file has been created, namely script.pyc
. (If you can't see file, note that you can explicitly list all files in the current directory using the ls
command in both Windows and Unix-like operating systems.) The extension of this new file stands for 'Compiled Python File'. Don't worry about this file -- you won't be able to inspect its contents using a text editor anyway. This file contains the numerical 'bytecode' that will be executed by your machine: it is this machine-readable version of your code that has actually been imported into the main.py
module. You can safely ignore these files, but now you know what they are for. (By the way: note that there is no main.pyc
file which has been created, because no functionality from this file has been imported into another module.)
It's always a good idea to distribute your code over a set of modules. In technical terms, your code should be as 'modular' as possible, meaning that similar functions should be grouped into the same module. This will help you keep your code organized especially when you are working on a larger project. If you have a set of functions that use you for loading and parsing files in Python, why not group under the same module? This way you can organize your own coding more efficiently, as well as share and document your code more easily.
Now you know how how to store and organize your code in separate files and modules. Still, a lot of programmers continue to explore their data using 'interactive Python' via a so-called 'interactive Python interpreter' that more or less resembles the IPython envirionment you have been working in so far. To launch such an interpreter, just type in python3
in your console and hit enter. This will launch a live
Python session in your command line console where you can experiment with your data by typing in commands, much like you have done in the IPython environment so far. Try this out! Just type in lines of Python code after the >>>
prompt and hit enter to execute it immediately.
When you make the exercices below, don't write your code in the IPython notebook anymore but write in a separate file and run them from the command line!
Inspired by Think Python by Allen B. Downey (http://thinkpython.com), Introduction to Programming Using Python by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.
frequencies.txt
. Make sure your program ignores capitalization as well as punctuation (hint: check out string.punctuation
online!). Search the web in order to find out how you can sort a dictionary -- this is not easy, because you might have to import another module.replace()
function for this.) Write the new version of novel to a file called starring_me.txt
.string.punctuation
online!). Try out the function on your Gutenberg book..py
), create two additional functions: one that spots 'hapaxes dislegomena' (words occuring only twice) and one that spots 'hapaxes trislegomena' (words occuring only three times) in a text file. Now import these functions in another, standalone script and call all three functions from there. Again, try them out on your Gutenberg-file.random.randint()
.
- Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
- Periods followed by a digit with no intervening whitespace are not sentence boundaries.
- Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
- Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
- Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.
You might want to check out string functions, like .islower()
and .isalpha()
in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. The result written to the new file should be:
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: