While reproducible research recommends having your data somewhere where you know it will not change, it may not be feasible to put large datafiles in your portfolio. We will work with a database of information about cities around the world:
https://www.maxmind.com/en/free-world-cities-database
Working with data structures can be done in many ways in Python. There are the standard Python arrays, lists and tuples. You can also use the arrays in the numpy package which allow you to do heavy math operations efficiently. For data analysis Pandas is often used, because data can be put into so-called dataframes. Data-frames store data with column and row names and can easily be manipulated and plotted. You will learn more about Pandas in the Machine Learning workshops. A short intro can be found here:
In [ ]:
import urllib.request as urllib, zipfile, os
url = 'http://download.maxmind.com/download/worldcities/'
filename = 'worldcitiespop.txt.gz'
datafolder = 'data/'
In [ ]:
downloaded = urllib.urlopen(url + filename)
buf = downloaded.read()
try:
os.mkdir(datafolder)
except FileExistsError:
pass
with open(datafolder + filename, 'wb') as f:
f.write(buf)
In [ ]:
import pandas as pd
cities = pd.read_csv(datafolder + filename, sep=',', low_memory=False, encoding = 'ISO-8859-1')
In [89]:
cities.tail()
#NAN VALUES: AKA NOT A NUMBER
Out[89]:
In [ ]:
cities.sort_values(by='Population', ascending=False).head()
By sorting the cities on population we immediately see the entries of a few of the largest cities in the world.
To get an idea of where in the world the cities in the dataset are located, we want to make a scatter plot of the position of all the cities in the dataset.
Don't worry about drawing country borders, just plot the locations of the cities.
Remember to use all the basic plot elements you need to understand this plot.
In [42]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
y = list(cities.Latitude)
x = list(cities.Longitude)
plt.scatter(x,y, 1, [0,0,0,1])
Out[42]:
Now we want to plot the cities in The Netherlands only. Use a scatter plot again to plot the cities, but now vary the size of the marker and the color with the population of that city.
Use a colorbar to show how the color of the marker relates to its population.
Use sensible limits to your axes so that you show only mainland The Netherlands (and not the Dutch Antilles).
In [82]:
dutch_cities = cities[ cities['Country'] =='nl' ]
plt.figure(figsize=[7,7]);
cm = plt.cm.get_cmap('YlOrRd')
y = dutch_cities.Latitude
x = dutch_cities.Longitude
pop = dutch_cities.Population
popsize = pop / 450
plt.xlim(3, 8)
plt.ylim(50.70, 53.6)
sc= plt.scatter(x,y,popsize,c=pop, cmap=cm, vmin=pop.min(), vmax=pop.max())
colobar = plt.colorbar(sc)
Using assignment B, we could clearly see larger cities such as Amsterdam, Rotterdam and even Eindhoven. But we still do not really have a clear overview of how many big cities there are. To show a distribution we use a histogram plot.
What happens if we do not call the .dropna() function?
Add proper basic plot elements to this plot and try to annotate which data point is Amsterdam and Eindhoven.
In [229]:
Eind = [i for i, j in enumerate(dutch_cities.City) if j == 'eindhoven']
Adam = [n for n, m in enumerate(dutch_cities.City) if m == 'amsterdam']
PopEind = dutch_cities.iloc[Eind].Population/1000
PopAdam = dutch_cities.iloc[Adam].Population/1000
plt.figure();
bars = plt.hist(np.asarray(dutch_cities.dropna().Population/1000), 100, normed=1);
plt.annotate('Eindhoven', xy=(PopEind,0), xytext=(PopEind, 0.005),
arrowprops=dict(facecolor='red', shrink = 0.01),
)
plt.annotate('Amsterdam', xy=(PopAdam,0), xytext=(PopAdam, 0.005),
arrowprops=dict(facecolor='grey', shrink = 0.05),
)
plt.xlabel('Aantal inwoners in duizenden')
plt.ylabel('Proportie steden met zoveel inwoners')
Out[229]:
In [243]:
plt.figure(figsize=[20, 8]);
plt.subplot(2,1,1);
plt.title("Dutch City Distribution")
plt.hist(np.asarray(dutch_cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);
plt.ylim(0.00, 0.10)
plt.subplot(2,1,2);
plt.title("Global City Distribution")
plt.hist(np.asarray(cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);
## add the subplot of the world cities below this Dutch one
In [ ]:
#It seems to me that there seem to be bigger cities in the Netherlands in general, although the Global distribution has the highest population in one city.