Please name your ipython notebook with the following naming convention: ASSIGNMENT_3b_FIRSTNAME_LASTNAME.ipynb*
Please submit your assignment (notebook + additional files) together with part 1 as a single .zip file using this google form*
If you have questions about this topic, please refer to the forum on the Canvas site.
In this part of the assignment, we will carry out our own little text analysis project. The goal is to gain some insights into longer texts without having to read them all in detail.
this part of the assignment builds on some notions that have been revised in part a of the assignment. Please feel free to go back to part a and reuse your code whenever possible.
The goals of this part are:
Tip: The assignment is split into 5 steps, which are divided into smaller steps. Instead of doing everything step by step, we highly recommend you read all sub-steps of a step first and then start coding. In many cases, the sub-steps are there to help you split the problem into manageable sub-problems, but it is still good to keep the overall goal in mind.
In the directory ../data/book/
, you should find three .txt files. If not, you can use the following cell to download them. Also, feel free to look at the code to learn how to download .txt files from the web.
We defined a function called download_book
which downloads a book in .txt format. Then we define a dictionary with names and urls. We loop through the dictionary, download each book and write it to a file stored in the directory books
in the current working directory. You don't need to do anything - just run the cell and the files will be downloaded to your computer.
In [ ]:
# Downloading data - you get this for free :-)
import requests
import os
def download_book(url):
"""
Download book given a url to a book in .txt format and return it as a string.
"""
text_request = requests.get(url)
text = text_request.text
return text
book_urls = dict()
book_urls['HuckFinn'] = 'http://www.gutenberg.org/cache/epub/7100/pg7100.txt'
book_urls['Macbeth'] = 'http://www.gutenberg.org/cache/epub/1533/pg1533.txt'
book_urls['AnnaKarenina'] = 'http://www.gutenberg.org/files/1399/1399-0.txt'
if not os.path.isdir('../Data/books/'):
os.mkdir('../Data/books/')
for name, url in book_urls.items():
text = download_book(url)
with open('../Data/books/'+name+'.txt', 'w') as outfile:
outfile.write(text)
In [ ]:
# your code here
For testing code, it is always good to use a little toy example. Since we don't want to test our code on all three books, let's use a single chapter of the Huck Finn novel. Let's say we want to look at Chapter 1 for now.
To do this, we need to:
Solve this first task by writing a function called get_chapter_HuckFinn
which returns the chapter. Define the function in a way that the user can specify which chapter should be returned. Set the default to Chapter 1. Use as many helper functions as you need and give them indicative names.
Please assign Chapter 1 of HuckFinn to a variable called test_chapter
.
Hint: Please dont't worry too much about splitting the chapters - as long as you make sure your solution works for most of the cases, it is alright for this exercise. However, it is always good to have an idea of what may go wrong and if there is a way to avoid it.
In [ ]:
# your code here
3.a) Let's get a little bit of an overview of what we can find in a chapter. Write a function taking a string as input and printing:
Test your function using Chapter 1 of HuckFinn as a test string.
3.b) Write a modified version of the function you used to load a single chapter so that it returns all chapters. Attention: The part before the first chapter should not be included. Then iterate over the chapters and print the basic statistics collected in 3.a for each chapter. Don't forget to print the chapter number as well.
Hint: If you did not manage 3.a, try to still iterate over the list of chapters and print the chapter number.
In [ ]:
# your code here
In [ ]:
Time to scale up: Now let's compare the three books to each other.
Preparation: Write a function called get_book
which takes a filepath as input and returns the entire book as text (you have already written this code above, just turn it into an individual function in case you haven't done so already).
4.a) Modify your function called get_chapters
which loads the book as a list of chapters or acts in a way that:
Hint 1: You can use the title of the book as a parameter.
4.b) Write a function called get_stats_books
which takes a list of filepaths to the three books as input (we have already created this list above) and prints:
Hint 1: You can get the booktitle from the filepath by using the os module and/or applying some string manipulations to the filepath. Hint 2: Use the function you already define above for the statistics (except the number of chapters)
In [ ]:
# your code here
The statistics above already provide some insights, but we want to know a bit more about what the books are about. To do this, we want to get the 20 most frequent nouns of each book. Since some of us are linguists, we are not satisfied with just looking at the surface forms of the tokens. Instead, we want to lemmatize them first. In addition, we want to write our results to files, so we can analyze them later. Since we're getting really good at programming, we are going to move our functions to a python script which we can run from the terminal. Don't worry, we will do everything step by step.
Preparation:
Write a function called lemmatize_nouns
which takes text as input and returns a list of lemmas. Hint: We have already written a function called tag_text
in part a of the assignment. Feel free to use the code to get the lemmas!
5.a.)
Write a new function called count_lemmatized_nouns
which takes text as input and counts the lemmas and returns a list of the most frequent ones. Add a parameter so the user can specify how many should be returned. By default, it should return 20.
Hint 1: Feel free to modify the lemmatize_nouns
function.
Hint 2: Feel free to use a Counter dictionary (in the collections module - google how to import and use if). You can also build your own dictionary for counting as we did in the Chapter on dictionaries.
Hint 3: Getting the most frequent nouns is really easy with the Counter dictionary in the collections module. If you are not using it, have a look at the sorted
function and how you can use it on tuples.
Test your function using test_text
and test_chapter
defined above
5.b.)
Write a function called lemmas_to_file
. It should take a list of words as input and write them to a file. Each noun should be appear on a new line. The name of the file should be a parameter. Test your function using the test_text or test_chapter and call the files test_text_most_frequent_nouns.txt
and test_chapter_most_frequent_nouns.txt
in your working directory.
Hint 1: If you use a context manager, you can loop through a list of lemmas within the manager.
Hint 2: To write new lines, remember to simply add the new line character to the string
5.c)
Time to combine everything! Write a function called get_nouns_books
which extracts the most frequent lemmas from the three books in our ../data/books/
directory and write them to a files given the filepaths. Make sure the filenames of the new files follow the this convention: [name]-topnouns20.txt
(e.g. HuckFinn-topnouns20.txt
).
5.d)
Now move all your functions to a python script called analyze_books.py
and call it from the terminal. Don't forget to import all the modules you are using and defining a list of all the filepaths. Don't worry if it runs for a while - it has to process a lot of text! If you computer can't handle it, it is fine if you remove a book from the list (Anna Karenina is by far the longest...).
Congratulations! You have written your first little nlp program!
In [ ]:
# feel fee to try out code here