T.N.Olsthoorn, Feb 26, 2017
See Excel workbook in this folder
For this we list all files in the folder and check each if it has '.xls' in it and doesn't start with '~', because that means that the file is actually active in the Excel program.
In [70]:
from pprint import pprint
import matplotlib.pyplot as plt
Import the os module (operating system). It has the method listdir(), which gives a list of the files within a directory.
We apply os.listdir() to get a list of all files and immediately filter it to get only the files that have .xls and don't start with '~'
In [71]:
import os
In [72]:
os.listdir('.')
Out[72]:
In [73]:
import os
xls_files = [f for f in os.listdir() if ".xls" in f and not f.startswith('~')]
print(xls_files)
Next we'll read the data from the Excel workbook
In [74]:
import openpyxl # excel file functionality
wb = openpyxl.load_workbook(xls_files[0])
In [75]:
type(wb)
Out[75]:
In [76]:
wb.get_sheet_names()
Out[76]:
In [77]:
# Show the sheetnames inside the Excel workbook wb
wb.get_sheet_names()
Out[77]:
We can now treat the workbook as a dictionary in which each key is the name of a worksheet
In [78]:
wsPop = wb['Population']
In [79]:
type(wsPop)
Out[79]:
wsPop is an object. It has a number of methods and properties, which can be accessed by typing the name followed by a dot followed by a tab. To get more information it can be followed by a question mark.
On the mac wsPop.rows is a tuple of tuples. On some other systems it is a generator. It generates the tuples one after the other in a loop like
In [80]:
for r in wsPop.rows:
pass # do something
but with a generator, one cannot say len(wsPop.rows) or index it like wsPop.rows[3].
However it's easy to first generate a list of tuples with a comprehension:
In [81]:
data = [r for r in wsPop.rows]
and then continue to work with data instead of wsPop.rows. Therefore, if you have trouble indexing wsPop.rows, first generate a list as shown above and use that wherever you see wsPop.rows below.
In [82]:
cel = wsPop.rows[0][3] # may not work if wsPop.rows is a generator on your system and not a tuple of tuple.
cel = data[0][3] # this is a list of tuples, generated above.
In [83]:
cel.value
Out[83]:
We gen get the contents of sheet as a series of rows using the attribute rows,
and then show an arbitrary value
In [85]:
print( type(wsPop.rows) ) # shows that the wsPop generates a tuple of tuples
print( len(wsPop.rows) ) # shows the number of rows
print( wsPop.rows[3][1] ) # shows some cell
wsPop.rows[3][1].value # shows the value of some cell
Out[85]:
In [86]:
for r in wsPop.rows:
print(r[1].value)
Turn the rows propery into a list of rows, with each row the values of the cells in them.
We can do that in a list comprehension, in this case a list comprehension around a list comprehension.
For each row we turn it into a list of the values of each cell in that row. We do that for each row.
The result is a list of lists. This is done in one line:
In [87]:
data = [[c.value for c in r ] for r in wsPop.rows]
In [88]:
data[15][1]
Out[88]:
In [89]:
# Show the first 5 lines
for i in range(5):
print(data[i])
To index the data directly, we don't want a list of lists but a dictionary with the country name as key, such that for country the data is kept in a dictionary with keys obtained form the headers in the first two rows of the excel file as is shown above.
In [90]:
hdrs = data[0]
dims = data[1]
print()
print(hdrs)
print()
print(dims)
print()
Now glue together the hdrs and the dimes, and filter out the None texts:
All hdrs are strings, but that is not the case of the dims, where -2017 was turned in a value.
So when guening h and d together to a new string below, we have to use h + str(d).
The combined headers are obtained using a list comprehension, that also removes the text 'None' from the dims wherever it turns up:
In [91]:
hdrs = [(h + str(d)).replace('None','') for h, d in zip(hdrs, dims)]
pprint(hdrs)
In [92]:
h = data[13][:]
In [93]:
print(h)
name = h.pop(1)
print(name)
print(h)
We could generate a dict with the country names as keys where each cntry[key] as a list of itmes. We cannot use a dict comprehension here because we need to pop the country value from each row in to get the key for that rows. Hence a for loop is used:
In [94]:
cntry = dict() # empty dict
for i in range(2, len(wsPop.rows)):
row = [c.value for c in wsPop.rows[i]]
cname = row.pop(1)
cntry[cname] = row # entter the key cname and the value row into the dict cntry
Now we can get the data of any cntry like so:
In [95]:
cntry['Italy']
Out[95]:
However, this is not smart enough. We can't see what the figure in the list mean. Therefore, we should use a dictionary for the contents of each cntry with the fields as keys.
These fields are now simply obtained by popping the second item from the hdrs list:
In [96]:
print(hdrs)
hdrs.pop(1)
print()
print(hdrs)
Now generate the dict again, with the contens of each cntry itself a dict of its fields:
In [97]:
cntry = dict()
for i in range(2, len(wsPop.rows)): # skip the first two lines, which are headers
row = [c.value for c in wsPop.rows[i]] # turn the Excel row in a list
cname = row.pop(1) # pop of the country name, which becomes the key
cntry[cname] = {fld: v for fld, v in zip(hdrs, row)} # use dict comprehension
Now the contents of an arbitrary country looks like this, it's a dict with fields and values.
In [98]:
cntry['Netherlands']['Population-2017']
Out[98]:
Let's now compute the total population of the world. We do this by summing for each country.
In [99]:
totPop2017 = 0. # start out with zero
for c in cntry.keys():
totPop2017 += cntry[c]['Population-2017'] # get the value directly from indexing the field
print('Total population in the world is {:.2f} billion'.format(totPop2017/1e9))
In [100]:
# Now compute the toal populatin in 30 years, using the current growth rates
popWorld = [0 for i in range(1, 31)]
for k in cntry.keys():
pc = cntry[k]['Population-2017'] # country population 2017
try:
frc = cntry[k]['Fert.Rate'] / 100. # growth rate fraction
except:
# needed in case frc contains `None`
frc = 0 # simply use 0 in those cases
# for the country
popCntry = [pc * (1 + frc)**i for i in range(1, 31) ]
# for the entire world
popWorld = [pc + pw for pc, pw in zip(popCntry, popWorld)]
for i, pw in enumerate(popWorld):
yr = 2017 + i + 1
print('popWorld [{}] = {:5.2f} billion'.format(yr, pw/1e9))
In [101]:
wscont = wb['CountriesByContinent'] # read it
print(wscont) # shows it's a worksheet object
print(wscont.rows[0]) # shows it's a tuple of two cell objects
wscont.rows[0][0].value, wscont.rows[0][1].value # turned into a tuple of tow strings
Out[101]:
We can immediately, in a single line, turn this worksheet into a dictionary with name cont that has the country as key and the continent as field (a string).
In [102]:
cont = { v.value : k.value for k, v in wb['CountriesByContinent'] }
pprint(cont) # notic the key : value pairs separated by commas
Show for a few countries in which continent they are:
In [103]:
for c in ['Bahamas', 'Spain', 'Morocco', 'Honduras', 'Cambodia']:
print('{:20} lies in {}'.format(c, cont[c]))
continent to the cntry dictWe like to add the field Continent to our cntry dict using the cont dict.
This would be easy if the country names in both dicts would be exactly the same.
Let's see if this is the case.
We can check this using set logic.
Convert the keys of he cntry dict to a set and do the same with those of the cont dict
First step: What countries are in th cont dict that are not in the cntry dict?
In [104]:
df_cont_cntry = set(cont.keys()) - set(cntry.keys())
pprint(df_cont_cntry)
print("\n{} countries are in dict `cont` that are not in dict `cntry`".format(len(df_cont_cntry)))
And likewise: which are in the cntry dict that are not in the cont dict?
In [105]:
df_cntry_cont = set(cntry.keys()) - set(cont.keys())
pprint(df_cntry_cont)
print("\n{} countries are in dict `cntry` that are not in dict `cont`".format(len(df_cntry_cont)))
The set of contries that are in cont but not in cntry obviously have differently spelled names.
Probably the easiest way is to take the names from cntry that are not in cont and look up the continent in which each of these contries lies, and use that to supplement the missing countries. We don't then have to worry about the misspelled names.
In [106]:
# notice that the columsns in this excel sheet are in columns 2 and 3 and not in 1 and 2 !!
# We construct the dict in one line, using a single dict comprehension
missing = {rw[2].value : rw[1].value for rw in wb['missing']} # don't need .rows
pprint(missing)
Now we can complete our cntry dict with a Continent attribute for every country in it by using the dict cont and the dict missing.
Let's just join them, like so:
In [107]:
for k in missing.keys():
cont[k] = missing[k]
And then add the continent to the cntry dict
In [108]:
for k in cntry.keys():
cntry[k]['Continent'] = cont[k]
Now we can print the country name with its continent next to it:
In [109]:
for k in cntry.keys():
print("{:30} lies in {}".format(k, cntry[k]['Continent']))
The first step is extract the continents from the dict using set logic, i.e. by set comprehension. The result is a set with the unique values from the Continent field.
In [110]:
continents = { cntry[k]['Continent'] for k in cntry.keys()} # set comprehension
print(continents)
It's then possible to compute the future population be selecting the countries for this continent to do the computation on:
In [111]:
continent='Europe'
# population and fertility rate for the countries of this continent:
popCont = [(cntry[k]['Population-2017'], cntry[k]['Fert.Rate']) for k in cntry.keys() if cntry[k]['Continent']==continent]
print("The number of countries in {} is {}".format(continent, len(popCont)))
The last but one step is to compute the population for the coming years in each continent. We make a dictionary pTot with the continent as key and which has the list of future population values as a list:
In [112]:
pTot = dict() # empty dict
Nyr = 50 # year to predict
print("The predicted population inover the next {} years is:".format(Nyr))
for c in continents:
pTot[c] = [0 for i in range(Nyr)] # start with empty total for each continent
# generate a list of [Pop, fertility rate] for each country in this continent
popCont = [(cntry[k]['Population-2017'], cntry[k]['Fert.Rate'])
for k in cntry.keys() if cntry[k]['Continent'] == c]
# compute the country's future population and add to continentn total
for p, fr in popCont:
try:
p = float(p)
fr = float(fr)
# population of country in coming years
pcntry = [p * (1 + fr/100.)**i for i in range(1, Nyr + 1)]
# add to continent total
pTot[c]= [pt + p for pt, p in zip(pTot[c], pcntry)]
except:
# it crashes when fertility rate is 'None', we ignore these frew contries
pass
print("{:10s}".format(c), end="") # print continent
for p in pTot[c]:
print("{:6.2f}".format(p / 1.0e9), end="") # print Pop values
print()
A better overview of the results would be to have columns, one per continent, and the numbers vertical, with the year as first column. Although the numbers are now not naturally ordered to do this, it can be realized with little effort as follows:
In [113]:
print("The predicted population in billions:")
continents = pTot.keys()
styear = 2018 # starting year
print("{:4s}".format("Year"), end="")
for c in continents:
print("{:>12}".format(c), end="")
print()
for i in range(Nyr):
print("{:4d}".format(styear + i), end="")
for c in continents:
print("{:12.2f}".format(pTot[c][i]/1e9), end="")
print()
Finally, make a plot of the growth curves:
In [114]:
years = [2018 + i for i in range(Nyr)]
for c in continents:
plt.plot(years, pTot[c], label=c)
plt.xlabel('year')
plt.ylabel('Population [billions]')
plt.title('Population development')
plt.legend(loc='best', fontsize='x-small')
plt.show()
In [116]:
#plt.pie?
We can also put the year results in a dict that has as key the year and as values a dict with the continents.
In [117]:
contPop = dict()
for yr in range(Nyr):
contPop[2018 + yr] = { c : v for c, v in zip(continents, [pTot[c][yr] for c in continents])}
# show it
pprint(contPop[2025])
Then finally we could make pie charts for the population at say 4 points in time.
In [118]:
import numpy as np
fig, axs = plt.subplots(2,2, sharex=True, sharey=True)
axs = axs.ravel()
for ax, yr in zip(axs.ravel(), [2020, 2030, 2040,2050]):
ax.set_title(str(yr))
x = np.array(list(contPop[yr].values()))/1.0e9
y = list(contPop[yr].keys())
#print(x)
#print(y)
r = np.sqrt(np.sum(x))/4
ax.xlim = []
ax.pie(x, labels=y, radius=r)
plt.show()
In [ ]: