In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.
We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while.
In [1]:
import urllib2, zipfile
In [2]:
url = 'http://ipython.rossant.net/'
In [3]:
filename = 'cities.zip'
In [4]:
downloaded = urllib2.urlopen(url + filename)
In [5]:
folder = 'data'
In [6]:
mkdir data
In [7]:
with open(filename, 'wb') as f:
f.write(downloaded.read())
In [8]:
with zipfile.ZipFile(filename) as zip:
zip.extractall(folder)
Now, we're going to load the CSV file that has been extracted with Pandas. The read_csv
function of Pandas can open any CSV file.
In [9]:
import pandas as pd
In [10]:
filename = 'data/worldcitiespop.txt'
In [11]:
data = pd.read_csv(filename)
Now, let's explore the newly created data object.
In [12]:
type(data)
Out[12]:
The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns.
In [13]:
data.shape, data.keys()
Out[13]:
We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively.
In [14]:
data.tail()
Out[14]:
We can see that these cities have NaN values as populations. The reason is that the population is not available for all cities in the data set, and Pandas handles those missing values transparently.
We'll see in the next sections what we can actually do with these data.
Each column of the DataFrame object can be accessed with its name. In IPython, tab completion proposes notably the different columns as attributes of the object. Here we get the series with the names of all cities (AccentCity is the full name of the city, with uppercase characters and accents).
In [15]:
data.AccentCity
Out[15]:
This column is an instance of the Series class. We can access to certain rows using indexing. In the following example, we get the name 30000th city (knowing that indexing is 0-based):
In [16]:
data.AccentCity[30000]
Out[16]:
So we can access to an element knowing its index. But how can we obtain a city from its name? For example, we'd like to obtain the population and GPS coordinates of New York. A possibility might be to loop through all cities and check their names, but it would be extremely slow because Python loops on millions on elements are not optimized at all. Pandas and NumPy offer a much more elegant and efficient way called boolean indexing. There are two steps that typically occur on the same line of code. First, we create an array with boolean values indicating, for each element, whether it satisfies a condition or not (if, whether the city name is New York). Then, we pass this array of booleans as an index to our original array: the result is then a subpart of the full array with only the elements corresponding to True. For example:
In [17]:
data[data.AccentCity=='New York'],
Out[17]:
The same syntax works in NumPy and Pandas. Here, we find a dozen of cities named New York, but only one happens to be in the New York state. To access a single element with Pandas, we can use the .ix attribute (for index):
In [18]:
ny = 2990572
data.ix[ny]
Out[18]:
Now, let's turn this Series object into a pure NumPy array. We go from the Pandas world to NumPy (keeping in mind that Pandas is built on top of NumPy). We'll mostly work with the population count of all cities.
In [19]:
population = array(data.Population)
In [20]:
population.shape
Out[20]:
The population array is a one-dimensional vector with the populations of all cities (or NaN if the population is not available). The population of New York can be accessed in NumPy with basic indexing:
In [21]:
population[ny]
Out[21]:
Let's find out how many cities do have an actual population count. To do this, we'll select all elements in the population array that have a value different to NaN. We can use the NumPy function isnan:
In [22]:
isnan(population)
Out[22]:
In [23]:
x = population[~_]
len(x), len(x) / float(len(population))
Out[23]:
There are about 1.5% of all cities in this data set that have a population count.
Let's explore now some statistics on the cities population.
In [24]:
x.mean()
Out[24]:
In [25]:
x.sum() / 1e9
Out[25]:
In [26]:
len(x)/float(len(population))
Out[26]:
The total population of those cities is about 2.3 billion people, about a third of the current world population. Hence, according to this data set, roughly 30% of the population lives in less than 1.5% of the cities in the world!
In [27]:
data.Population.describe()
Out[27]:
Now, let's locate some geographical coordinates.
In [28]:
locations = data[['Latitude','Longitude']].as_matrix()
In [29]:
def locate(x, y):
d = locations - array([x, y])
distances = d[:,0] ** 2 + d[:,1] ** 2
closest = distances.argmin()
return data.AccentCity[closest]
In [30]:
print(locate(48.861, 2.3358))