In this block, we covered the following chapters about data formats:
In this first part of Assignment 4, you will be asked to read and write CSV/TSV and JSON data.
In the folder ../Data/csv_data
there is a TSV file called trump_facebook.tsv
that contains Facebook status updates posted by Donald Trump. It was downloaded from here. Follow the instructions below to read the file and find specific status updates.
Write a function called read_csv()
that takes two input parameters: input_file
(positional parameter) and delimiter
(keyword parameter with default string ","
). The function should read the file and return status_updates
which contains the content of the file as a 'list of dicts'. When tested on ../Data/Trump-Facebook/FacebookStatuses.tsv
the first two status updates should thus be represented as follows:
[{'link_name': 'Timeline Photos',
'num_angrys': '7',
'num_comments': '543',
'num_hahas': '17',
'num_likes': '6178',
'num_loves': '572',
'num_reactions': '6813',
'num_sads': '0',
'num_shares': '359',
'num_wows': '39',
'status_id': '153080620724_10157915294545725',
'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
'status_published': '10/17/2016 20:56:51',
'status_type': 'photo'},
{'link_name': '',
'num_angrys': '5211',
'num_comments': '3644',
'num_hahas': '75',
'num_likes': '26649',
'num_loves': '487',
'num_reactions': '33768',
'num_sads': '191',
'num_shares': '17653',
'num_wows': '1155',
'status_id': '153080620724_10157914483265725',
'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',
'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!",
'status_published': '10/17/2016 18:00:41',
'status_type': 'video'}]
DO NOT USE THE CSV MODULE FOR THIS EXERCISE!
In [ ]:
def read_csv(input_file, delimiter=","):
# your code here
# test your function here
filename = "../Data/csv_data/trump_facebook.tsv"
status_updates = read_csv(filename, delimiter="\t")
status_updates[0:2]
In case you didn't manage to create the read_csv()
function, run the following code using the DictReader()
method from the csv
module to get the data in the right format for the following exercises:
In [ ]:
import csv
filename = "../Data/csv_data/trump_facebook.tsv"
with open(filename, "r") as infile:
status_updates = []
csv_reader = csv.DictReader(infile, delimiter='\t')
for row in csv_reader:
status_updates.append(row)
status_updates[0:2]
Define a function called get_update_most_responded_to()
that takes two input parameters: status_updates
(positional parameter) and response_type
(keyword parameter with default string "likes"
). The fuction should find the status update that received the highest number of 'angrys', 'comments', 'hahas', etc. It should return three strings: the status_message
, the status_type
and the status_link
of this particular status update.
In [ ]:
def get_update_most_responded_to():
# your code here
# test your function here
Define a function called get_longest_update()
that takes two input parameters: status_updates
(positional parameter) and length_type
(keyword parameter with default string "words"
). The fuction should find the status update that is the longest in terms of the characters, words or sentences in the message. It should return one string: the status_message
of this particular status update.
In [ ]:
def get_longest_update():
# your code here
# test your function here
Define a function called get_updates_with_keywords()
that takes three input parameters: status_updates
(positional parameter), keywords
(positional parameter) and case_sensitive
(keyword parameter with default False
). The fuction should find the status updates that contain any of the keywords. The parameter case_sensitive
should specify whether uppercase and lowercase characters must be treated as distinct. The function should return filtered_status_updates
, which is a list of dicts with all information about the status updates (same format as input parameter). Make sure that you tokenize the messages.
In [ ]:
def get_update_with_keywords():
# your code here
keywords = ["clinton", "obama"] # test with these keywords; also experiment with other keywords
# test your function here
There is a lot of interesting data online. For example, the Nobel Prize Organisaton provides the Nobel Prize API that allows you to download information about the prizes, the laureates and the countries.
The information is formatted in JSON. Have a look at the following URLs:
For this exercise, we will only look at the prizes and the laureates.
We can download the data using the requests
module. How this works is shown below.
In [ ]:
import requests
In [ ]:
# Download data on prizes
api_url = "http://api.nobelprize.org/v1/prize.json"
r = requests.get(api_url)
dict_prizes = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_prizes
In [ ]:
# Download data on laureates
api_url = "http://api.nobelprize.org/v1/laureate.json"
r = requests.get(api_url)
dict_laureates = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_laureates
In [ ]:
Create a function called get_laureates()
that takes three input parameters: dict_prizes
(positional parameter), year
(keyword parameter with default None
) and category
(keyword parameter with default None
). The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category. It should return a list of the full names of the laureates. For example, for the year 2018 and category "peace" it should return the list ['Denis Mukwege', 'Nadia Murad']
.
In [ ]:
def get_laureates():
# your code here
year = 2018
category = "peace"
# test your function here
Create a function called get_affiliation_prizes()
that takes one input parameter: dict_laureates
(positional parameter). The function should find all affiliates that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:
{
"A.F. Ioffe Physico-Technical Institute": [
{"category": "physics", "year": "2000"}
],
"Aarhus University": [
{"category": "chemistry", "year": "1997"},
{"category": "economics","year": "2010"}
]
}
Tip: some of the entries will miss information (for example, there is no associated affiliation). Use if-statements
to check if essential information is present.
In [ ]:
def get_affiliation_prizes():
# your code here
# test your function here
In [ ]:
# write the resulting dictionary to 'json_file'