ADS-DV

Plotting scatter plots and histograms

Summary

This assignment first shows you how to download csv data from an online source. Then we're exploring a dataset of all the cities in the world and compare cities in The Netherlands to the rest of the world.

Loading data CSV and Pandas

While reproducible research recommends having your data somewhere where you know it will not change, it may not be feasible to put large datafiles in your portfolio. We will work with a database of information about cities around the world:

https://www.maxmind.com/en/free-world-cities-database

Working with data structures can be done in many ways in Python. There are the standard Python arrays, lists and tuples. You can also use the arrays in the numpy package which allow you to do heavy math operations efficiently. For data analysis Pandas is often used, because data can be put into so-called dataframes. Data-frames store data with column and row names and can easily be manipulated and plotted. You will learn more about Pandas in the Machine Learning workshops. A short intro can be found here:

http://pandas.pydata.org/pandas-docs/stable/10min.html



In [ ]:

    
import urllib.request as urllib, zipfile, os

url = 'http://download.maxmind.com/download/worldcities/'
filename = 'worldcitiespop.txt.gz'
datafolder = 'data/'



In [ ]:

    
downloaded = urllib.urlopen(url + filename)
buf = downloaded.read()

try:
    os.mkdir(datafolder)
except FileExistsError:
    pass

with open(datafolder + filename, 'wb') as f:
    f.write(buf)



In [ ]:

    
import pandas as pd
cities = pd.read_csv(datafolder + filename, sep=',', low_memory=False, encoding = 'ISO-8859-1')

Data manipulation

We can take a peek at the data by checking out the final rows of data. Do you see any potential problem with this dataset?



In [89]:

    
cities.tail()
#NAN VALUES: AKA NOT A NUMBER









    Out[89]:






  
    
      
      Country
      City
      AccentCity
      Region
      Population
      Latitude
      Longitude
    
  
  
    
      3173953
      zw
      zimre park
      Zimre Park
      04
      NaN
      -17.866111
      31.213611
    
    
      3173954
      zw
      ziyakamanas
      Ziyakamanas
      00
      NaN
      -18.216667
      27.950000
    
    
      3173955
      zw
      zizalisari
      Zizalisari
      04
      NaN
      -17.758889
      31.010556
    
    
      3173956
      zw
      zuzumba
      Zuzumba
      06
      NaN
      -20.033333
      27.933333
    
    
      3173957
      zw
      zvishavane
      Zvishavane
      07
      79876.0
      -20.333333
      30.033333



In [ ]:

    
cities.sort_values(by='Population', ascending=False).head()

By sorting the cities on population we immediately see the entries of a few of the largest cities in the world.

Assignment A

To get an idea of where in the world the cities in the dataset are located, we want to make a scatter plot of the position of all the cities in the dataset.

Don't worry about drawing country borders, just plot the locations of the cities.

Remember to use all the basic plot elements you need to understand this plot.



In [42]:

    
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline 

y = list(cities.Latitude)
x = list(cities.Longitude)

plt.scatter(x,y, 1, [0,0,0,1])









    Out[42]:





<matplotlib.collections.PathCollection at 0x23628575c18>

Assignment B

Now we want to plot the cities in The Netherlands only. Use a scatter plot again to plot the cities, but now vary the size of the marker and the color with the population of that city.

Use a colorbar to show how the color of the marker relates to its population.

Use sensible limits to your axes so that you show only mainland The Netherlands (and not the Dutch Antilles).



In [82]:

    
dutch_cities = cities[ cities['Country'] =='nl' ]
plt.figure(figsize=[7,7]);


cm = plt.cm.get_cmap('YlOrRd')
y = dutch_cities.Latitude
x = dutch_cities.Longitude
pop = dutch_cities.Population
popsize = pop / 450

plt.xlim(3, 8)
plt.ylim(50.70, 53.6)

sc= plt.scatter(x,y,popsize,c=pop, cmap=cm, vmin=pop.min(), vmax=pop.max())

colobar = plt.colorbar(sc)

Assignment C

Using assignment B, we could clearly see larger cities such as Amsterdam, Rotterdam and even Eindhoven. But we still do not really have a clear overview of how many big cities there are. To show a distribution we use a histogram plot.

What happens if we do not call the .dropna() function?

Add proper basic plot elements to this plot and try to annotate which data point is Amsterdam and Eindhoven.



In [229]:

    
Eind = [i for i, j in enumerate(dutch_cities.City) if j == 'eindhoven']
Adam = [n for n, m in enumerate(dutch_cities.City) if m == 'amsterdam']

PopEind = dutch_cities.iloc[Eind].Population/1000
PopAdam = dutch_cities.iloc[Adam].Population/1000

plt.figure();
bars = plt.hist(np.asarray(dutch_cities.dropna().Population/1000), 100, normed=1);

plt.annotate('Eindhoven', xy=(PopEind,0), xytext=(PopEind, 0.005),
            arrowprops=dict(facecolor='red', shrink = 0.01),
            )
plt.annotate('Amsterdam', xy=(PopAdam,0), xytext=(PopAdam, 0.005),
            arrowprops=dict(facecolor='grey', shrink = 0.05),
            )
plt.xlabel('Aantal inwoners in duizenden')
plt.ylabel('Proportie steden met zoveel inwoners')









    Out[229]:





<matplotlib.text.Text at 0x2361062bc18>

Assignment D

Now we want to compare how the distribution of Dutch cities compares to that of the entire world.

Use subplots to show the dutch distribution (top plot) and the world distribution (bottom plot).



In [243]:

    
plt.figure(figsize=[20, 8]);
plt.subplot(2,1,1);
plt.title("Dutch City Distribution")
plt.hist(np.asarray(dutch_cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);
plt.ylim(0.00, 0.10)
plt.subplot(2,1,2);
plt.title("Global City Distribution")
plt.hist(np.asarray(cities.dropna().Population/1000), bins=np.arange(0, 200, 1), normed=1);

## add the subplot of the world cities below this Dutch one

Assignment E

Write what conclusions you can deduce from the above plots?



In [ ]:

    
#It seems to me that there seem to be bigger cities in the Netherlands in general, although the Global distribution has the highest population in one city.

	Country	City	AccentCity	Region	Population	Latitude	Longitude
3173953	zw	zimre park	Zimre Park	04	NaN	-17.866111	31.213611
3173954	zw	ziyakamanas	Ziyakamanas	00	NaN	-18.216667	27.950000
3173955	zw	zizalisari	Zizalisari	04	NaN	-17.758889	31.010556
3173956	zw	zuzumba	Zuzumba	06	NaN	-20.033333	27.933333
3173957	zw	zvishavane	Zvishavane	07	79876.0	-20.333333	30.033333