Assignment 3b: Writing your own nlp program

Due: Friday the 23th of November 2017 23:59 p.m.

Please name your ipython notebook with the following naming convention: ASSIGNMENT_3b_FIRSTNAME_LASTNAME.ipynb*
Please submit your assignment (notebook + additional files) together with part 1 as a single .zip file using this google form*
If you have questions about this topic, please refer to the forum on the Canvas site.

In this part of the assignment, we will carry out our own little text analysis project. The goal is to gain some insights into longer texts without having to read them all in detail.

this part of the assignment builds on some notions that have been revised in part a of the assignment. Please feel free to go back to part a and reuse your code whenever possible.

The goals of this part are:

divide a problem into smaller sub-problems and test code using small examples
doing text analysis and writing results to a file
combining small functions into bigger functions

Tip: The assignment is split into 5 steps, which are divided into smaller steps. Instead of doing everything step by step, we highly recommend you read all sub-steps of a step first and then start coding. In many cases, the sub-steps are there to help you split the problem into manageable sub-problems, but it is still good to keep the overall goal in mind.

Step 1: Data collection

In the directory ../data/book/, you should find three .txt files. If not, you can use the following cell to download them. Also, feel free to look at the code to learn how to download .txt files from the web.

We defined a function called download_book which downloads a book in .txt format. Then we define a dictionary with names and urls. We loop through the dictionary, download each book and write it to a file stored in the directory books in the current working directory. You don't need to do anything - just run the cell and the files will be downloaded to your computer.



In [ ]:

    
# Downloading data - you get this for free :-) 

import requests
import os


def download_book(url):
    """
    Download book given a url to a book in .txt format and return it as a string.
    """
    text_request = requests.get(url)
    text = text_request.text
    return text


book_urls = dict()
book_urls['HuckFinn'] = 'http://www.gutenberg.org/cache/epub/7100/pg7100.txt'
book_urls['Macbeth'] = 'http://www.gutenberg.org/cache/epub/1533/pg1533.txt'
book_urls['AnnaKarenina'] = 'http://www.gutenberg.org/files/1399/1399-0.txt'


if not os.path.isdir('../Data/books/'):
    os.mkdir('../Data/books/')
    
    
for name, url in book_urls.items():
    text = download_book(url)
    with open('../Data/books/'+name+'.txt', 'w') as outfile:
        outfile.write(text)

Exercise 1

Was the download successful? Use a specific python module to print a list of all the files in the directory ../Data/books.



In [ ]:

    
# your code here

Exercise 2

For testing code, it is always good to use a little toy example. Since we don't want to test our code on all three books, let's use a single chapter of the Huck Finn novel. Let's say we want to look at Chapter 1 for now.

To do this, we need to:

read the correct file
look at the txt file and see how the chapters are separated from each other
assign Chapter 1 to a variable so we can reuse it

Solve this first task by writing a function called get_chapter_HuckFinn which returns the chapter. Define the function in a way that the user can specify which chapter should be returned. Set the default to Chapter 1. Use as many helper functions as you need and give them indicative names.

Please assign Chapter 1 of HuckFinn to a variable called test_chapter.

Hint: Please dont't worry too much about splitting the chapters - as long as you make sure your solution works for most of the cases, it is alright for this exercise. However, it is always good to have an idea of what may go wrong and if there is a way to avoid it.



In [ ]:

    
# your code here

Exercise 3

3.a) Let's get a little bit of an overview of what we can find in a chapter. Write a function taking a string as input and printing:

The number of sentences
The number of tokens
The size of the vocabulary used (i.e. unique tokens)

Test your function using Chapter 1 of HuckFinn as a test string.

3.b) Write a modified version of the function you used to load a single chapter so that it returns all chapters. Attention: The part before the first chapter should not be included. Then iterate over the chapters and print the basic statistics collected in 3.a for each chapter. Don't forget to print the chapter number as well.

Hint: If you did not manage 3.a, try to still iterate over the list of chapters and print the chapter number.



In [ ]:

    
# your code here



In [ ]:

Exercise 4

Time to scale up: Now let's compare the three books to each other.

Preparation: Write a function called get_book which takes a filepath as input and returns the entire book as text (you have already written this code above, just turn it into an individual function in case you haven't done so already).

4.a) Modify your function called get_chapters which loads the book as a list of chapters or acts in a way that:

it takes the book as a string as input
it splits it into chapters or acts (look at the way that chapters or acts are separated from each other in the other books and use conditions to split them correctly) and returns a list of chapters or acts

Hint 1: You can use the title of the book as a parameter.

4.b) Write a function called get_stats_books which takes a list of filepaths to the three books as input (we have already created this list above) and prints:

The number of chapters/acts
The total number of sentences in the book
The total number of tokens in the book
The size of the vocabulary used in the book
Don't forget to print the name of each book and leave an empty line in between books

Hint 1: You can get the booktitle from the filepath by using the os module and/or applying some string manipulations to the filepath. Hint 2: Use the function you already define above for the statistics (except the number of chapters)



In [ ]:

    
# your code here

Exercise 5

The statistics above already provide some insights, but we want to know a bit more about what the books are about. To do this, we want to get the 20 most frequent nouns of each book. Since some of us are linguists, we are not satisfied with just looking at the surface forms of the tokens. Instead, we want to lemmatize them first. In addition, we want to write our results to files, so we can analyze them later. Since we're getting really good at programming, we are going to move our functions to a python script which we can run from the terminal. Don't worry, we will do everything step by step.

Preparation:

Write a function called lemmatize_nouns which takes text as input and returns a list of lemmas. Hint: We have already written a function called tag_text in part a of the assignment. Feel free to use the code to get the lemmas!

5.a.)

Write a new function called count_lemmatized_nouns which takes text as input and counts the lemmas and returns a list of the most frequent ones. Add a parameter so the user can specify how many should be returned. By default, it should return 20.

Hint 1: Feel free to modify the lemmatize_nouns function.
Hint 2: Feel free to use a Counter dictionary (in the collections module - google how to import and use if). You can also build your own dictionary for counting as we did in the Chapter on dictionaries.
Hint 3: Getting the most frequent nouns is really easy with the Counter dictionary in the collections module. If you are not using it, have a look at the sorted function and how you can use it on tuples.

Test your function using test_text and test_chapter defined above

5.b.)

Write a function called lemmas_to_file. It should take a list of words as input and write them to a file. Each noun should be appear on a new line. The name of the file should be a parameter. Test your function using the test_text or test_chapter and call the files test_text_most_frequent_nouns.txt and test_chapter_most_frequent_nouns.txt in your working directory.

Hint 1: If you use a context manager, you can loop through a list of lemmas within the manager.

Hint 2: To write new lines, remember to simply add the new line character to the string

5.c)

Time to combine everything! Write a function called get_nouns_books which extracts the most frequent lemmas from the three books in our ../data/books/ directory and write them to a files given the filepaths. Make sure the filenames of the new files follow the this convention: [name]-topnouns20.txt (e.g. HuckFinn-topnouns20.txt).

5.d)

Now move all your functions to a python script called analyze_books.py and call it from the terminal. Don't forget to import all the modules you are using and defining a list of all the filepaths. Don't worry if it runs for a while - it has to process a lot of text! If you computer can't handle it, it is fine if you remove a book from the list (Anna Karenina is by far the longest...).

Congratulations! You have written your first little nlp program!



In [ ]:

    
# feel fee to try out code here