Assignment 4a: Reading and writing CSV/TSV and JSON data

Due: Tuesday the 4th of December 2018 at 20:00

  • Please name your notebook with the following naming convention: ASSIGNMENT_4a_FIRSTNAME_LASTNAME.ipynb
  • Please submit your complete assignment (4a + 4b) by compressing all your material (notebooks + python files + additional files) into a single .zip file following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip.
    Use this google form for submission.
  • If you have questions about this assignment, please refer to the forum on the Canvas site.

In this block, we covered the following chapters about data formats:

  • Chapter 16 - Data Formats I (CSV/TSV)
  • Chapter 17 - Data Formats II (JSON)
  • Chapter 18 - Data Formats III (XML)

In this first part of Assignment 4, you will be asked to read and write CSV/TSV and JSON data.

Exercise 1: Trump's Facebook Status Updates (CSV/TSV)

In the folder ../Data/csv_data there is a TSV file called trump_facebook.tsv that contains Facebook status updates posted by Donald Trump. It was downloaded from here. Follow the instructions below to read the file and find specific status updates.

1a. Write your own function for reading CSV

Write a function called read_csv() that takes two input parameters: input_file (positional parameter) and delimiter (keyword parameter with default string ","). The function should read the file and return status_updates which contains the content of the file as a 'list of dicts'. When tested on ../Data/Trump-Facebook/FacebookStatuses.tsv the first two status updates should thus be represented as follows:

[{'link_name': 'Timeline Photos',
  'num_angrys': '7',
  'num_comments': '543',
  'num_hahas': '17',
  'num_likes': '6178',
  'num_loves': '572',
  'num_reactions': '6813',
  'num_sads': '0',
  'num_shares': '359',
  'num_wows': '39',
  'status_id': '153080620724_10157915294545725',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'status_published': '10/17/2016 20:56:51',
  'status_type': 'photo'},
 {'link_name': '',
  'num_angrys': '5211',
  'num_comments': '3644',
  'num_hahas': '75',
  'num_likes': '26649',
  'num_loves': '487',
  'num_reactions': '33768',
  'num_sads': '191',
  'num_shares': '17653',
  'num_wows': '1155',
  'status_id': '153080620724_10157914483265725',
  'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!",
  'status_published': '10/17/2016 18:00:41',
  'status_type': 'video'}]

DO NOT USE THE CSV MODULE FOR THIS EXERCISE!


In [ ]:
def read_csv(input_file, delimiter=","):
    # your code here


# test your function here
filename = "../Data/csv_data/trump_facebook.tsv"
status_updates = read_csv(filename, delimiter="\t") 
status_updates[0:2]

In case you didn't manage to create the read_csv() function, run the following code using the DictReader() method from the csv module to get the data in the right format for the following exercises:


In [ ]:
import csv

filename = "../Data/csv_data/trump_facebook.tsv"
with open(filename, "r") as infile:
    status_updates = []
    csv_reader = csv.DictReader(infile, delimiter='\t')
    for row in csv_reader:
        status_updates.append(row)
status_updates[0:2]

1b. Find the status updates with the most responses

Define a function called get_update_most_responded_to() that takes two input parameters: status_updates (positional parameter) and response_type (keyword parameter with default string "likes"). The fuction should find the status update that received the highest number of 'angrys', 'comments', 'hahas', etc. It should return three strings: the status_message, the status_type and the status_link of this particular status update.


In [ ]:
def get_update_most_responded_to():
    # your code here

# test your function here

1c. Find the longest status updates

Define a function called get_longest_update() that takes two input parameters: status_updates (positional parameter) and length_type (keyword parameter with default string "words"). The fuction should find the status update that is the longest in terms of the characters, words or sentences in the message. It should return one string: the status_message of this particular status update.


In [ ]:
def get_longest_update():
    # your code here

# test your function here

1d. Find the status updates containing specific keywords

Define a function called get_updates_with_keywords() that takes three input parameters: status_updates (positional parameter), keywords (positional parameter) and case_sensitive (keyword parameter with default False). The fuction should find the status updates that contain any of the keywords. The parameter case_sensitive should specify whether uppercase and lowercase characters must be treated as distinct. The function should return filtered_status_updates, which is a list of dicts with all information about the status updates (same format as input parameter). Make sure that you tokenize the messages.


In [ ]:
def get_update_with_keywords():
    # your code here

keywords = ["clinton", "obama"] # test with these keywords; also experiment with other keywords
# test your function here

Exercise 2: Nobel Prize Winners (JSON)

There is a lot of interesting data online. For example, the Nobel Prize Organisaton provides the Nobel Prize API that allows you to download information about the prizes, the laureates and the countries.

The information is formatted in JSON. Have a look at the following URLs:

For this exercise, we will only look at the prizes and the laureates.

We can download the data using the requests module. How this works is shown below.


In [ ]:
import requests

In [ ]:
# Download data on prizes
api_url = "http://api.nobelprize.org/v1/prize.json"
r = requests.get(api_url)
dict_prizes = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_prizes

In [ ]:
# Download data on laureates
api_url = "http://api.nobelprize.org/v1/laureate.json"
r = requests.get(api_url)
dict_laureates = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_laureates

2a. Read the JSON files

We have already stored the data as the JSON files laureate.json and prize.json in the folder ../Data/json_data/NobelPrize. Open these JSON files and load them as the Python dictionaries dict_laureates and dict_prizes.


In [ ]:

2b. Get all laureates from year and category

Create a function called get_laureates() that takes three input parameters: dict_prizes (positional parameter), year (keyword parameter with default None) and category (keyword parameter with default None). The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category. It should return a list of the full names of the laureates. For example, for the year 2018 and category "peace" it should return the list ['Denis Mukwege', 'Nadia Murad'].


In [ ]:
def get_laureates():
    # your code here                     


year = 2018
category = "peace"
# test your function here

2c. Get all prizes from affiliations

Create a function called get_affiliation_prizes() that takes one input parameter: dict_laureates (positional parameter). The function should find all affiliates that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:

{
    "A.F. Ioffe Physico-Technical Institute": [
        {"category": "physics", "year": "2000"}
    ],
    "Aarhus University": [
        {"category": "chemistry", "year": "1997"},
        {"category": "economics","year": "2010"}
    ]
}

Tip: some of the entries will miss information (for example, there is no associated affiliation). Use if-statements to check if essential information is present.


In [ ]:
def get_affiliation_prizes():
    # your code here

# test your function here

2d. Write to JSON

Next, write the dictionary created in the previous exercise as JSON to the file ../Data/json_data/NobelPrize/nobel_prizes_affiliations.json.


In [ ]:
# write the resulting dictionary to 'json_file'