Please name your ipython notebook with the following naming convention: ASSIGNMENT_3a_FIRSTNAME_LASTNAME.ipynb*
Please submit your assignment (notebooks of part 1 and 2 + additional files) as a single .zip file using this google form*
If you have questions about this topic, please refer to the forum on the Canvas site.
In this block, we covered a lot of ground:
In this assignment, you will first complete a number of small exercises about each chapter to make sure you are familiar with the most important concepts. In the second part of the assignment, you will apply your newly acquired skills to write your very own text processing program (ASSIGNMENT-3b) :-). But don't worry, there will be instructions and hints along the way.
In the first part of this assignment, you will be revising some of the basic notions we covered in the previous chapters. Most of the exercises can be completed rather quickly. If you get stuck, you should be able to complete them by going bach to the chapters. The Purpose of this part is to make you gain some practice and confidence so you are all set and ready to move on to part 2 of the assignment - processing and analyzing some text!
Define a function called split_sort_text
which takes a string as input, splits it at space charaters and returns all the unique words in the string in alphabetical order.
In [ ]:
# your code here
NLTK offers a way of using WordNet in python. Do some research (using google, because quite frankly, that's what we do very often) and see if you can find out how import it. WordNet is a computational lexicon which organizes words according to their senses (collected in synsets). See if you can print all the synsets (i.e. entries) of the word 'dog'.
In [ ]:
# your code here
count
which counts the words in a string. Do not use NLTK just yet. Find a way to test it.Hint 1: Write a helper-function called preprocess
which preprocesses the string (split it, remove punctuation, return it in a container that you think works best for the next steps).
Hint 2: Remember that there are string methods which you can use to get rid of unwanted characters. Test the preprocess
function using the string 'this is a (tricky) test'.
Tip 3: Remember how we used dictionaries to count words? If not, have a look at the containers chapter.
Hint 4: Test your function using an example string which will tell you whether it fullfils the requirements (remove punctuation, split, count). You will get a point for good testing.
Use your editor to create a python script called count_words.py
. Move your code into the python script and add a function call. Move your helper function to a seperate script which you call utils.py
. Import your helper function into word_counts.py
. Test whether everything works as expected by calling the scipt word_counts.py
from the terminal. Include your tests in the word_counts.py
script.
Please submit your scripts together with this notebook in a single folder and upload the entire folder to the google form.
Don't forget to add docstrings to your functions.
In [ ]:
# Feel free to use this cell to try out your code.
Playing with lyrics
a.) Write a function called load_text
which opens and reads a file and returns the text in the file. It should take a filepath as a parameter. Test it by loading this file: ../Data/lyrics/walrus.txt
b.) Write a function called replace_walrus
which takes lyrics as input and replaces every instance of 'walrus' by 'hippo' (make sure to account for upper and lower case - it is fine to transform everything to lower case). The function should write the new version of the song to a file called 'walrus_hippo.txt and stored in ../Data/lyrics.
Don't forget to add docstrings to your functions.
In [ ]:
# your code here
Building a simple NLP pipeline
For this exercise, you will need NLTK. Don't forget to import it.
Write a function called tag_text
which takes raw text as input and returns the tagged text. To do this, make sure you follow the steps below:
Tokenize the text.
Perform part-of-speech tagging on the list of tokens.
Return the tagged text
Then test your function using the text snipped below (test_text
) as input.
In [ ]:
test_text = """Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean."""
In [ ]:
# your code here
[answer]
In [ ]:
# your code here
6.b) What is the difference between the modes 'w' and 'a' when opening a file?