ADS-DV

Plotting scatter plots and histograms

Summary

This assignment first shows you how to download csv data from an online source. Then we're exploring a dataset of all the cities in the world and compare cities in The Netherlands to the rest of the world.

Loading data CSV and Pandas

While reproducible research recommends having your data somewhere where you know it will not change, it may not be feasible to put large datafiles in your portfolio. We will work with a database of information about cities around the world:

https://www.maxmind.com/en/free-world-cities-database

Working with data structures can be done in many ways in Python. There are the standard Python arrays, lists and tuples. You can also use the arrays in the numpy package which allow you to do heavy math operations efficiently. For data analysis Pandas is often used, because data can be put into so-called dataframes. Data-frames store data with column and row names and can easily be manipulated and plotted. You will learn more about Pandas in the Machine Learning workshops. A short intro can be found here:

http://pandas.pydata.org/pandas-docs/stable/10min.html


In [ ]:
import urllib.request as urllib, zipfile, os

url = 'http://download.maxmind.com/download/worldcities/'
filename = 'worldcitiespop.txt.gz'
datafolder = 'data/'

In [ ]:
downloaded = urllib.urlopen(url + filename)
buf = downloaded.read()

try:
    os.mkdir(datafolder)
except FileExistsError:
    pass

with open(datafolder + filename, 'wb') as f:
    f.write(buf)

In [ ]:
import pandas as pd
cities = pd.read_csv(datafolder + filename, sep=',', low_memory=False, encoding = 'ISO-8859-1')

Data manipulation

We can take a peek at the data by checking out the final rows of data. Do you see any potential problem with this dataset?


In [89]:
cities.tail()
#NAN VALUES: AKA NOT A NUMBER


Out[89]:
Country City AccentCity Region Population Latitude Longitude
3173953 zw zimre park Zimre Park 04 NaN -17.866111 31.213611
3173954 zw ziyakamanas Ziyakamanas 00 NaN -18.216667 27.950000
3173955 zw zizalisari Zizalisari 04 NaN -17.758889 31.010556
3173956 zw zuzumba Zuzumba 06 NaN -20.033333 27.933333
3173957 zw zvishavane Zvishavane 07 79876.0 -20.333333 30.033333

In [ ]:
cities.sort_values(by='Population', ascending=False).head()

By sorting the cities on population we immediately see the entries of a few of the largest cities in the world.

Assignment A

To get an idea of where in the world the cities in the dataset are located, we want to make a scatter plot of the position of all the cities in the dataset.

Don't worry about drawing country borders, just plot the locations of the cities.

Remember to use all the basic plot elements you need to understand this plot.


In [42]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline 

y = list(cities.Latitude)
x = list(cities.Longitude)

plt.scatter(x,y, 1, [0,0,0,1])


Out[42]:
<matplotlib.collections.PathCollection at 0x23628575c18>

Assignment B

Now we want to plot the cities in The Netherlands only. Use a scatter plot again to plot the cities, but now vary the size of the marker and the color with the population of that city.

Use a colorbar to show how the color of the marker relates to its population.

Use sensible limits to your axes so that you show only mainland The Netherlands (and not the Dutch Antilles).


In [82]:
dutch_cities = cities[ cities['Country'] =='nl' ]
plt.figure(figsize=[7,7]);


cm = plt.cm.get_cmap('YlOrRd')
y = dutch_cities.Latitude
x = dutch_cities.Longitude
pop = dutch_cities.Population
popsize = pop / 450

plt.xlim(3, 8)
plt.ylim(50.70, 53.6)

sc= plt.scatter(x,y,popsize,c=pop, cmap=cm, vmin=pop.min(), vmax=pop.max())

colobar = plt.colorbar(sc)


Assignment C

Using assignment B, we could clearly see larger cities such as Amsterdam, Rotterdam and even Eindhoven. But we still do not really have a clear overview of how many big cities there are. To show a distribution we use a histogram plot.

What happens if we do not call the .dropna() function?

Add proper basic plot elements to this plot and try to annotate which data point is Amsterdam and Eindhoven.


In [229]:
Eind = [i for i, j in enumerate(dutch_cities.City) if j == 'eindhoven']
Adam = [n for n, m in enumerate(dutch_cities.City) if m == 'amsterdam']

PopEind = dutch_cities.iloc[Eind].Population/1000
PopAdam = dutch_cities.iloc[Adam].Population/1000

plt.figure();
bars = plt.hist(np.asarray(dutch_cities.dropna().Population/1000), 100, normed=1);

plt.annotate('Eindhoven', xy=(PopEind,0), xytext=(PopEind, 0.005),
            arrowprops=dict(facecolor='red', shrink = 0.01),
            )
plt.annotate('Amsterdam', xy=(PopAdam,0), xytext=(PopAdam, 0.005),
            arrowprops=dict(facecolor='grey', shrink = 0.05),
            )
plt.xlabel('Aantal inwoners in duizenden')
plt.ylabel('Proportie steden met zoveel inwoners')


Out[229]:
<matplotlib.text.Text at 0x2361062bc18>

Assignment D

Now we want to compare how the distribution of Dutch cities compares to that of the entire world.

Use subplots to show the dutch distribution (top plot) and the world distribution (bottom plot).


In [243]:
plt.figure(figsize=[20, 8]);
plt.subplot(2,1,1);
plt.title("Dutch City Distribution")
plt.hist(np.asarray(dutch_cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);
plt.ylim(0.00, 0.10)
plt.subplot(2,1,2);
plt.title("Global City Distribution")
plt.hist(np.asarray(cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);

## add the subplot of the world cities below this Dutch one


Assignment E

Write what conclusions you can deduce from the above plots?


In [ ]:
#It seems to me that there seem to be bigger cities in the Netherlands in general, although the Global distribution has the highest population in one city.