For this exercise, we need to a) open the file data/Dalziel2016_data.csv
for reading; b) read all the lines; c) add the name of the city to a data structure, making sure that we have no repeated entry. For this reason, we're going to work with sets. We start by initializing an empty set called cities
:
In [1]:
cities = set([]) # initialize an empty set
Now we open the file for reading. We use the with
statement that takes care of closing the file:
In [2]:
import csv # we this module to handle csv files
with open('../data/Dalziel2016_data.csv', 'r') as f: # 'r' stands for reading
my_csv = csv.DictReader(f) # set up the csv reader
for line in my_csv: # loop over all lines
print(line)
break # break the loop after printing the first line to inspect results
In the code above, we have imported the module csv
, which allows us to parse character-delimited files. In this case, we do not need to specify any special option, as we're reading a plain-vanilla csv
file, delimited by commas.
Having opened the file, we create a DictReader
object, which parses each line, creating a dictionary whose entries are the values for each of the columns (named as specified by the header of the csv
file).
You can see that in the dictionary, the city is identified by the key 'loc'
. We can therefore add the value line['loc']
to the set, completing the exercise:
In [3]:
import csv # we use the csv module, as we want to read a csv file
with open('../data/Dalziel2016_data.csv', 'r') as f: # 'r' stands for reading
my_csv = csv.DictReader(f)
for line in my_csv:
cities.add(line['loc'])
Now all the cities are stored in the set cities
, with all the duplicates automatically removed (as we're using a set):
In [4]:
cities
Out[4]:
This task requires a slightly different approach. We need to keep track of how many records are associated with each city. We can therefore create a dictionary citycount
storing the city (key) and the associated number of records (value).
Because initially the dictionary is empty, every time we encounter a new city we need to add a key to the dictionary. The simplest way to do this is to use the dictionary method get
, which allows us to either update the value (if the key is already present), or to add a new key (if the key is not present). For example:
In [5]:
a = {} # an empty dictionary
a['my_new_key'] = a.get('my_new_key', 0) + 1
a
Out[5]:
The code above shows that when the key is not already, present, the key will be added, and its value will be initially 1
. If on the other hand the key is present, we will simply increment its associated value:
In [6]:
a['my_new_key'] = a.get('my_new_key', 0) + 1
a
Out[6]:
With this at hand, we can write our program:
In [7]:
citycount = {} # initiate an empty dictionary
import csv # we use the csv module, as we want to read a csv file
with open('../data/Dalziel2016_data.csv', 'r') as f: # 'r' stands for reading
my_csv = csv.DictReader(f)
for line in my_csv:
# this is the city to update
mycity = line['loc']
# if it's present, increment the value
# if it's not present, initialize to 1
citycount[mycity] = citycount.get(mycity, 0)
citycount[mycity] = citycount[mycity] + 1
That's it. Let's print the counts for a few cities:
In [8]:
for city in ['CHICAGO', 'LOS ANGELES', 'NEW YORK']:
print(city, citycount[city])
We can proceed as before. Remember that the mean is the sum of elements divided by the number of elements ($\mathbb E[x_1, x_2, x_3, x_4, \ldots, x_n] = \frac{1}{n}\sum_{i=1}^{n} x_i$).
Therefore, we can simply keep summing the population at each step, and at the end divide by the number of records. We create a new dictionary, citypop
whose value is a list, containing the current sum of the population, and the number of records for the city:
In [9]:
citypop = {}
import csv # we use the csv module, as we want to read a csv file
with open('../data/Dalziel2016_data.csv', 'r') as f: # 'r' stands for reading
my_csv = csv.DictReader(f)
for line in my_csv:
# this is the city to update
mycity = line['loc']
# current pop
pop = float(line['pop']) # transform to float
# if it's present, increment the value
# if it's not present, initialize a list with both population and count as zero
citypop[mycity] = citypop.get(mycity, [0,0])
# update population (stored as first value of list)
citypop[mycity][0] = citypop[mycity][0] + pop
# update number of records (stored as second value of list)
citypop[mycity][1] = citypop[mycity][1] + 1
In [10]:
citypop
Out[10]:
Excellent. Now each key in the dictionary indexes a list whose first element is the sum of all the population values, and the second element is the number of records that contributed to the sum. To obtain the average population, we divide the first by the second:
In [11]:
for city in citypop.keys():
citypop[city][0] = citypop[city][0] / citypop[city][1]
Let's see some of the averages to make sure they make sense:
In [12]:
for city in ['CHICAGO', 'LOS ANGELES', 'NEW YORK']:
print(city, citypop[city][0])
If we want print only a few decimals, we can use round
:
In [13]:
for city in ['CHICAGO', 'LOS ANGELES', 'NEW YORK']:
print(city, round(citypop[city][0],1))
Though this exercise looks very much like the previous one, we need to change the data structure slightly. In fact, now each city contains many years, and each year should index the corresponding population. The following solution uses a dictionary (where the keys are the cities) of dictionaries (where the keys are the years) of lists (accumulated population, number of records per year)!
In [14]:
cityyear = {}
In [15]:
cityyear = {}
import csv # we use the csv module, as we want to read a csv file
with open('../data/Dalziel2016_data.csv', 'r') as f: # 'r' stands for reading
my_csv = csv.DictReader(f)
for line in my_csv:
# this is the city to update
mycity = line['loc']
# this is the year to update
year = line['year']
# current pop
pop = float(line['pop']) # transform to float
# make sure the city is in the dictionary, or initialize
cityyear[mycity] = cityyear.get(mycity, {})
# make sure the year is in the sub-dictionary, or initialize
cityyear[mycity][year] = cityyear[mycity].get(year, [0,0])
# now proceed as for exercise 3 but access the inner dictionary
# update population
cityyear[mycity][year][0] = cityyear[mycity][year][0] + pop
# update number of records
cityyear[mycity][year][1] = cityyear[mycity][year][1] + 1
# now compute averages
for city in cityyear.keys():
for year in cityyear[city].keys():
cityyear[city][year][0] = cityyear[city][year][0] / cityyear[city][year][1]
Let's look at the results for Chicago: you can see that the population grew by more than 50% in the period covered by the data!
In [16]:
# a dictionary has no natural order but here we want to order by year
# store the years in a list
years = list(cityyear['CHICAGO'].keys())
# sort the years
years.sort() # this is done in place!
# now print population for each year
for year in years:
print(year, round(cityyear['CHICAGO'][year][0]))