Exploratory Computing with Python

Developed by Mark Bakker with additions by Theo Olsthoorn

Notebook 5: Time series data and Pandas



In [198]:

    
# We will probably need these
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Loading real data with Pandas

Real data is often stored in CSV files (Comma Separated Values). So far, we have loaded csv files with the np.loadtxt command. The loadtxt function has some basic functionality and works just fine, but when we have more elaborate data sets we want more sophisticated functionality. The most powerful and advanced package for data handling and analysis is called pandas. We will use only a few functions of the pandas package here. Full information on pandas can be found on the pandas website.

Consider the following dataset, which is stored in the file transport.csv. It shows the percentage of transportation kilometers by car, bus or rail for four countries. The dataset has four columns.

country, car, bus, rail
some more explanations, yada yada yada
France, 86.1, 5.3, 8.6
Germany, 85.2, 7.1, 7.7
Netherlands, 86.4, 4.6, 9
United Kingdom, 88.2, 6.5, 5.3

This data file is loaded with the read_csv function of the pandas package. The read_csv function has many options. We will use three of them here. The rows that need to be skipped are defined with the skiprows keyword (in this case row 1 with the yada yada text). The skipinitialspace keyword is set to True so that the column name ' car' is loaded without the initial space that is in the data file. And the index_col keyword is set to indicate that the names in column 0 can be used as an index to select a row.



In [2]:

    
from pandas import read_csv
tran = read_csv('transport.csv', skiprows=[1], skipinitialspace=True, index_col=0)

Pandas loads data into a DataFrame. A DataFrame is like an array, but has many additional features that make it a very powerful datatype for data analysis. For starters, once you have loaded the data, you can print it to the screen



In [4]:

    
print(tran)









    



                 car  bus  rail
country                        
France          86.1  5.3   8.6
Germany         85.2  7.1   7.7
Netherlands     86.4  4.6   9.0
United Kingdom  88.2  6.5   5.3

When the DataFrame is large, you can still print it to the screen (pandas is smart enough not to show the entire DataFrame when it is very large), or you can simply print the first 5 lines of the DataFrame with the .head() function.

Basic DataFrame manipulation

The read_csv function returns what is called a DataFrame. The variable tran in the code above is now a DataFrame. The rows and columns of a DataFrame may have names, as you can see from printing the DataFrame to the screen. To find out which names are used for the columns, use the keys function, which is accessible with the dot syntax. You can loop through the names of the columns if you wish.



In [75]:

    
print('Names of columns:')
print(tran.keys())

print("\nAccessing rows of a Series, columns of a dataFrame or the major axis of a panel use keys()")
for k in tran.keys():
    print(k)

print("\nAccessing columns in a dataFrame:")
for k in tran.columns:
    print(k)

print("\nAccessing rows in a series or a dataFrame:")
for k in tran.index:
    print(k)









    



Names of columns:
Index(['car', 'bus', 'rail'], dtype='object', name='transport')

Accessing rows of a Series, columns of a dataFrame or the major axis of a panel use keys()
car
bus
rail

Accessing columns in a dataFrame:
car
bus
rail

Accessing rows in a series or a dataFrame:
France
Germany
Netherlands
United Kingdom



In [9]:

    
tran.columns









    Out[9]:





Index(['car', 'bus', 'rail'], dtype='object')



In [10]:

    
tran.index









    Out[10]:





Index(['France', 'Germany', 'Netherlands', 'United Kingdom'], dtype='object', name='country')

The name of the index of the rows and the columns



In [76]:

    
print("The name of the index of the dataFrame:")
tran.index.name









    



The name of the index of the dataFrame:






    Out[76]:





'country'

The column headers may be given a nname,



In [77]:

    
tran.columns.name = 'transport'
tran









    Out[77]:






  
    
      transport
      car
      bus
      rail
    
    
      country
      
      
      
    
  
  
    
      France
      86.1
      5.3
      8.6
    
    
      Germany
      85.2
      7.1
      7.7
    
    
      Netherlands
      86.4
      4.6
      9.0
    
    
      United Kingdom
      88.2
      6.5
      5.3

Indexing items in a dataFrame like in a numpy array by the method .iloc( )

Each DataFrame may be indexed just like an array, by specifying the row and column number using the .iloc syntax (which stands for index location), where column 0 is the column labeled car (since the column labeled as country was stored as an index when reading the csv file; more on that later).



In [30]:

    
print("Item from first row and second column:\n", tran.iloc[0,1])  # Gives the bus data for France
print("\nItem from second row and first column:\n", tran.iloc[1,0])  # Gives the car data for Germany
print("\nItem from row 3 and column 3:\n", tran.iloc[2,2])  # Gives the rail data for Netherlands









    



Item from first row and second column:
 5.3

Item from second row and first column:
 85.2

Item from row 3 and column 3:
 9.0

We can also print entire rows or columns. But when a single row or a single column is selected from a dataframe, it is no longer a dataframe but a pandas series object:



In [38]:

    
print("\nRow 4, a pandas series:\n",tran.iloc[3])  # All data for United Kindom
print("\nThis is neither an array nor a dataframe but a {} object.".format(type(tran.iloc[3])))
print("\nColumn 2:\n", tran.iloc[:,1])  # All data for bus
print("\nThis too is neither an array nor a dataframe by a {} object.".format(type(tran.iloc[:,1])))









    



Row 4, a pandas series:
 transport
car     88.2
bus      6.5
rail     5.3
Name: United Kingdom, dtype: float64

This is neither an array nor a dataframe but a <class 'pandas.core.series.Series'> object.

Column 2:
 country
France            5.3
Germany           7.1
Netherlands       4.6
United Kingdom    6.5
Name: bus, dtype: float64

This too is neither an array nor a dataframe by a <class 'pandas.core.series.Series'> object.

Explicit indexing using the row and column index by method .loc( )

Indexing like an array is in fact a primitive way to access such high level objects like dataframes and series. A better way and more consistent and systematic way is indexing by means of the columns and row indexes.

One way is to use the .loc syntax instead of the iloc syntax. This is a bit more typing but much more explicit. The equivalent of the code cell above, but using indices by name is



In [48]:

    
print("France, bus :", tran.loc['France','bus'])
print("\nGermany','car", tran.loc['Germany','car'])
print("\nNetherlands','rail", tran.loc['Netherlands','rail'])
print("\n",tran.loc['United Kingdom'])
print("\n",tran.loc[:,'bus'])

print("\nThe series object has received the name of the original row or column in the dataframe printed at the bottom.")









    



France, bus : 5.3

Germany','car 85.2

Netherlands','rail 9.0

 transport
car     88.2
bus      6.5
rail     5.3
Name: United Kingdom, dtype: float64

 country
France            5.3
Germany           7.1
Netherlands       4.6
United Kingdom    6.5
Name: bus, dtype: float64

The series object has received the name of the original row or column in the dataframe printed at the bottom.

Because using the index is far more explicit to select from a pandas dataframe or series, we don't use it any more below.

Selecting enitre columns, three ways

There are two alternative ways to access an entire column. First, you can simply specify the column name as an index, without having to use the .loc syntax. Second, the dot syntax may be used, like syntax .column_name, where column_name is the name of the column. Hence, the following three are equivalent



In [55]:

    
print(tran.loc[:,'car'])  # All rows of 'car' column
print('\n', tran['car'])   # 'car' column 
print('\n', tran.car)      # also the car column, but accessed as an attribute of object tran

print('\n', tran.car['France'])









    



country
France            86.1
Germany           85.2
Netherlands       86.4
United Kingdom    88.2
Name: car, dtype: float64

 country
France            86.1
Germany           85.2
Netherlands       86.4
United Kingdom    88.2
Name: car, dtype: float64

 country
France            86.1
Germany           85.2
Netherlands       86.4
United Kingdom    88.2
Name: car, dtype: float64

 86.1

Selecting entire rows, use .loc( )

If you don't want an entire column but an entire row, use loc as above:



In [64]:

    
print(tran.loc["France"])









    



transport
car     86.1
bus      5.3
rail     8.6
Name: France, dtype: float64

`numpy` functions for DataFrames

Series can be pretty much treated as arrays, especially when they contain data. Most numpy functions work on DataFrames, but they can also be accessed with the dot syntax, like dataframe_name.function(). Simply type

tran.

in a code cell and then hit the [tab] key to see all the functions that are available. In the code cell below, we compute the maximum value of transportation by car, the country corresponding to the maximum value of transportation by car, and the mean value of all transportation by car. You can also find all values larger than a specified value, just like for arrays:



In [82]:

    
print('Maximum car travel percentage:               ', tran.car.max())
print('Country with maximum car travel percentage:  ', tran.car.argmax())
print('Mean car travel percentage:                  ', tran.car.mean())

print('\nAll rail travel above 8 percent:')
print(tran.rail[tran.rail > 8])









    



Maximum car travel percentage:                88.2
Country with maximum car travel percentage:   United Kingdom
Mean car travel percentage:                   86.47500000000001

All rail travel above 8 percent:
country
France         8.6
Netherlands    9.0
Name: rail, dtype: float64

Exercise 1. Average annual rainfall by country

The file annual_precip.csv contains the average yearly rainfall and total land area for all the countries in the world (well, there are some missing values); the data is available on the website of the world bank. Open the data file to see what it looks like (Notepad, Textedit, even Word if you have to). Load the data with the read_csv function of pandas, making sure that the names of the countries can be used to select a row, and perform the following tasks:

Print the first 5 lines of the DataFrame to the screen with the .head() function.
Print the average annual rainfall for Panama and make sure to include the units.
Report the total land area of the Netherlands and make sure to include the units.
Report all countries with an average annual rainfall less than 200 mm/year
Report all countries with an average annual rainfall more than 2500 mm/year
Report all countries with an average annual rainfall that is within 50 mm/year of the average annual rainfall in the Netherlands



In [85]:

    
from pandas import read_csv
rain = read_csv('annual_precip.csv', skiprows=2, index_col=0)
#
print('First five lines of rain dataset:')
print(rain.head())
#
print('\nAverage annual rainfall in Panama is',rain.loc['Panama','precip'],'mm/year')
#
print('\nLand area of the Netherlands is', rain.loc['Netherlands','area'], 'thousand km^2/year')
#
print('\nCountries where average rainfall is below 200 mm/year')
print(rain[ rain.precip < 200 ])
#
print('\nCountries where average rainfall is above 2500 mm/year')
print(rain[ rain.precip > 2500 ])
#
print('Countries with almost the same rainfall as Netherlands')
print(rain[abs(rain.loc['Netherlands','precip'] - rain.precip) < 50])









    



First five lines of rain dataset:
                precip    area
country                       
Afghanistan      327.0   652.2
Albania         1485.0    27.4
Algeria           89.0  2381.7
American Samoa     NaN     0.2
Andorra            NaN     0.5

Average annual rainfall in Panama is 2692.0 mm/year

Land area of the Netherlands is 33.7 thousand km^2/year

Countries where average rainfall is below 200 mm/year
                      precip    area
country                             
Algeria                 89.0  2381.7
Bahrain                 83.0     0.8
Egypt, Arab Rep.        51.0   995.5
Jordan                 111.0    88.8
Kuwait                 121.0    17.8
Libya                   56.0  1759.5
Mauritania              92.0  1030.7
Niger                  151.0  1266.7
Oman                   125.0   309.5
Qatar                   74.0    11.6
Saudi Arabia            59.0  2149.7
Turkmenistan           161.0   469.9
United Arab Emirates    78.0    83.6
Yemen, Rep.            167.0   528.0

Countries where average rainfall is above 2500 mm/year
                       precip    area
country                              
Bangladesh             2666.0   130.2
Brunei Darussalam      2722.0     5.3
Colombia               2612.0  1109.5
Costa Rica             2926.0    51.1
Fiji                   2592.0    18.3
Indonesia              2702.0  1811.6
Malaysia               2875.0   328.6
Panama                 2692.0    74.3
Papua New Guinea       3142.0   452.9
Sao Tome and Principe  3200.0     1.0
Sierra Leone           2526.0    71.6
Solomon Islands        3028.0    28.0
Countries with almost the same rainfall as Netherlands
                 precip    area
country                        
Burkina Faso      748.0   273.6
Lesotho           788.0    30.4
Mexico            752.0  1944.0
Netherlands       778.0    33.7
Slovak Republic   824.0    48.1
Swaziland         788.0    17.2

Adding a column to a DataFrame

A column may be added to a DataFrame by simply specifying the name and values of the new column using the syntax DataFrame['newcolumn']=something. For example, let's add a column named public_transport, which is the sum of the bus and rail columns, and then find the country with the largest percentage of public transport



In [86]:

    
tran['public_transport'] = tran.bus + tran.rail
print('Country with largest percentage public transport:',tran.public_transport.argmax())









    



Country with largest percentage public transport: Germany

Plotting DataFrames

You can plot the column or row of a DataFrame with the matplotlib functions, as we have done in previous Notebooks, but pandas has also implemented its own, much more convenient, plotting functions (still based on matplotlib in the background, of course). The plotting capabilities of pandas also use the dot syntax, like dataframe.plot(). All columns can be plotted simultaneously (note that the names appear on the axes and the legend is added automatically!).



In [89]:

    
tran.plot();  # plot all columns
plt.show()

You can also plot one column at a time. The style of the plot may be specified with the kind keyword (the default is 'line'). Check out tran.plot? for more options.



In [90]:

    
tran['bus'].plot(kind='bar')
plt.show()

Sorting DataFrames

DataFrames may be sorted with the .sort function. The keyword inplace=True replaces the values in the DataFrame with the new sorted values (when inplace=False a new DataFrame is returned, which you can store in a separate variable so that you have two datasets, one sorted and one unsorted). The sort funcion has several keyword arguments, including columns which is either the name of one column to sort by or a list of columns so that data is sorted by the first specified column and when values are equal they are sorted by the next column in the list. Another keyword is ascending, which you can use to specify whether to sort in ascending order (ascending=True, which is the default), or descending order (ascending=False)



In [91]:

    
print('Data sorted by car use:')
print(tran.sort(columns='car'))
print('Data sorted by bus use:')
print(tran.sort(columns='bus'))









    



Data sorted by car use:
transport        car  bus  rail  public_transport
country                                          
Germany         85.2  7.1   7.7              14.8
France          86.1  5.3   8.6              13.9
Netherlands     86.4  4.6   9.0              13.6
United Kingdom  88.2  6.5   5.3              11.8
Data sorted by bus use:
transport        car  bus  rail  public_transport
country                                          
Netherlands     86.4  4.6   9.0              13.6
France          86.1  5.3   8.6              13.9
United Kingdom  88.2  6.5   5.3              11.8
Germany         85.2  7.1   7.7              14.8






    



/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:2: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  from ipykernel import kernelapp as app
/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:4: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

Renaming columns

Sometimes (quite often, really), the name of columns in a dataset are not very convenient (long, including spaces, etc.). For the example of the transportation data, the columns have convenient names, but let's change them for demonstration purposes. You can rename columns inplace, and you can change as many columns as you want. The old and new names are specified with a Python dictionary. A dictionary is a very useful data type. It is specified between braces {}, and links a word in the dictionary to a value. The value can be anything. You can then use the word in the dictionary as the index, just like you would look up a word in an paper dictionary.



In [92]:

    
firstdictionary = {'goals': 20, 'city': 'Delft'}
print(firstdictionary['goals'])
print(firstdictionary['city'])









    



20
Delft

Much more on Python dictionaries can be found, for example, here. Let's continue with renaming two of the columns of the tran DataFrame:



In [93]:

    
tran.rename(columns={'bus': 'BUS', 
                     'rail': 'train'}, inplace=True)
tran









    Out[93]:






  
    
      transport
      car
      BUS
      train
      public_transport
    
    
      country
      
      
      
      
    
  
  
    
      France
      86.1
      5.3
      8.6
      13.9
    
    
      Germany
      85.2
      7.1
      7.7
      14.8
    
    
      Netherlands
      86.4
      4.6
      9.0
      13.6
    
    
      United Kingdom
      88.2
      6.5
      5.3
      11.8

The index column, with the countries, is now called 'country', but we can rename that too, for example to 'somewhere in Europe', with the following syntax



In [94]:

    
tran.index.names = ['somewhere in Europe']
tran









    Out[94]:






  
    
      transport
      car
      BUS
      train
      public_transport
    
    
      somewhere in Europe
      
      
      
      
    
  
  
    
      France
      86.1
      5.3
      8.6
      13.9
    
    
      Germany
      85.2
      7.1
      7.7
      14.8
    
    
      Netherlands
      86.4
      4.6
      9.0
      13.6
    
    
      United Kingdom
      88.2
      6.5
      5.3
      11.8

World population data

We'll read in the csv file worldPopulation.csv using pandas read_csv. Pandas read_csv as a large number of keywork options, most of wich can ben generally ignored so that their defaults are used. However the current file has two header lines that need to be handled.

First look at the options, their meaning can be found on the internet

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

We have a two-line header, so specify wich lines comprise this: header[0,1] This will be about all we have to do to properly readin this entire file.



In [170]:

    
import pandas
worldPop = pandas.read_csv('worldPopulation2017.csv', header=[0, 1])
worldPop









    Out[170]:






  
    
      
      #
      Country (or dependency)
      Population
      Yearly
      Net
      Density
      Land Area
      Migrants
      Fert.
      Med.
      Urban
    
    
      
      Unnamed: 0_level_1
      Unnamed: 1_level_1
      -2017
      Change
      Change
      (P/Km2)
      (Km2)
      (net)
      Rate
      Age
      Pop %
    
  
  
    
      0
      1
      China
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37
      57.60%
    
    
      1
      2
      India
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27
      32%
    
    
      2
      3
      U.S.
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38
      82.10%
    
    
      3
      4
      Indonesia
      263510146
      1.12%
      2929046
      146
      1811066
      -140000.0
      2.5
      28
      53.40%
    
    
      4
      5
      Brazil
      211243220
      0.80%
      1675300
      25
      8349534
      3185.0
      1.82
      31
      83.50%
    
    
      5
      6
      Pakistan
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23
      38.10%
    
    
      6
      7
      Nigeria
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18
      47.80%
    
    
      7
      8
      Bangladesh
      164827718
      1.18%
      1916854
      1266
      130175
      -445296.0
      2.23
      26
      34.50%
    
    
      8
      9
      Russia
      143375006
      -0.05%
      -64826
      9
      16292614
      223577.0
      1.66
      39
      73.20%
    
    
      9
      10
      Mexico
      130222815
      1.24%
      1590811
      67
      1943624
      -104717.0
      2.29
      27
      77.30%
    
    
      10
      11
      Japan
      126045211
      -0.22%
      -278504
      346
      364503
      70000.0
      1.4
      47
      94.30%
    
    
      11
      12
      Ethiopia
      104344901
      2.45%
      2491633
      104
      1000430
      -12000.0
      4.59
      19
      19.40%
    
    
      12
      13
      Philippines
      103796832
      1.51%
      1546699
      348
      298181
      -140000.0
      3.04
      24
      44.20%
    
    
      13
      14
      Viet Nam
      95414640
      1.03%
      970440
      308
      310090
      -40000.0
      1.96
      30
      33.80%
    
    
      14
      15
      Egypt
      95215102
      1.96%
      1831528
      96
      994933
      -43136.0
      3.38
      25
      39%
    
    
      15
      16
      DR Congo
      82242685
      3.16%
      2520061
      36
      2265639
      -19184.0
      6.15
      17
      38.30%
    
    
      16
      17
      Iran
      80945718
      1.13%
      902572
      50
      1628686
      -60000.0
      1.75
      30
      73.40%
    
    
      17
      18
      Germany
      80636124
      -0.06%
      -46227
      231
      348621
      250000.0
      1.39
      46
      77.20%
    
    
      18
      19
      Turkey
      80417526
      1%
      795464
      105
      769546
      400001.0
      2.1
      30
      71.20%
    
    
      19
      20
      Thailand
      68297547
      0.22%
      150938
      134
      510827
      20000.0
      1.53
      38
      51%
    
    
      20
      21
      U.K.
      65511098
      0.61%
      399955
      271
      241917
      180000.0
      1.92
      40
      81.20%
    
    
      21
      22
      France
      64938716
      0.42%
      270587
      119
      547544
      66311.0
      2
      41
      80.20%
    
    
      22
      23
      Italy
      59797978
      -0.01%
      -3026
      203
      294137
      105654.0
      1.43
      46
      70.70%
    
    
      23
      24
      Tanzania
      56877529
      3.12%
      1722056
      64
      885943
      -40000.0
      5.24
      17
      30.60%
    
    
      24
      25
      South Africa
      55436360
      0.83%
      457453
      46
      1213049
      120000.0
      2.4
      26
      63.40%
    
    
      25
      26
      Myanmar
      54836483
      0.87%
      473057
      84
      653593
      -94856.0
      2.25
      28
      34.50%
    
    
      26
      27
      South Korea
      50704971
      0.40%
      201038
      522
      97229
      60000.0
      1.26
      41
      81.40%
    
    
      27
      28
      Colombia
      49067981
      0.85%
      413589
      44
      1110135
      -29000.0
      1.93
      30
      78.40%
    
    
      28
      29
      Kenya
      48466928
      2.57%
      1215479
      85
      568861
      -10000.0
      4.44
      19
      25.80%
    
    
      29
      30
      Spain
      46070146
      0.01%
      5542
      92
      498595
      -118614.0
      1.32
      43
      82%
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      203
      204
      Andorra
      68728
      -0.63%
      -437
      146
      470
      NaN
      N.A.
      N.A.
      N.A.
    
    
      204
      205
      Cayman Islands
      61557
      1.31%
      793
      257
      240
      NaN
      N.A.
      N.A.
      98.50%
    
    
      205
      206
      Bermuda
      61352
      -0.50%
      -310
      1227
      50
      NaN
      N.A.
      N.A.
      N.A.
    
    
      206
      207
      Saint Kitts and Nevis
      56780
      1.06%
      597
      218
      260
      NaN
      N.A.
      N.A.
      31.70%
    
    
      207
      208
      Greenland
      56239
      0.08%
      43
      0
      562390
      NaN
      N.A.
      N.A.
      88.50%
    
    
      208
      209
      American Samoa
      55653
      0.09%
      51
      278
      200
      NaN
      N.A.
      N.A.
      87.40%
    
    
      209
      210
      Northern Mariana Islands
      55567
      0.32%
      178
      121
      460
      NaN
      N.A.
      N.A.
      89%
    
    
      210
      211
      Marshall Islands
      53132
      0.12%
      63
      295
      180
      NaN
      N.A.
      N.A.
      73.20%
    
    
      211
      212
      Faeroe Islands
      48335
      0.20%
      96
      35
      1397
      NaN
      N.A.
      N.A.
      43.30%
    
    
      212
      213
      Sint Maarten
      40117
      1.46%
      579
      1180
      34
      NaN
      N.A.
      N.A.
      N.A.
    
    
      213
      214
      Liechtenstein
      38022
      0.65%
      246
      238
      160
      NaN
      N.A.
      N.A.
      14.20%
    
    
      214
      215
      Monaco
      38010
      0.39%
      147
      25510
      1
      NaN
      N.A.
      N.A.
      N.A.
    
    
      215
      216
      Turks and Caicos Islands
      35442
      1.54%
      538
      37
      950
      NaN
      N.A.
      N.A.
      91.10%
    
    
      216
      217
      Gibraltar
      32472
      0.31%
      99
      3247
      10
      NaN
      N.A.
      N.A.
      90.40%
    
    
      217
      218
      San Marino
      32104
      0.48%
      154
      535
      60
      NaN
      N.A.
      N.A.
      93.70%
    
    
      218
      219
      British Virgin Islands
      31200
      1.76%
      541
      208
      150
      NaN
      N.A.
      N.A.
      43.30%
    
    
      219
      220
      Caribbean Netherlands
      25699
      1.46%
      371
      78
      328
      NaN
      N.A.
      N.A.
      58.90%
    
    
      220
      221
      Palau
      21726
      1.05%
      225
      47
      460
      NaN
      N.A.
      N.A.
      86.70%
    
    
      221
      222
      Cook Islands
      21069
      0.58%
      121
      88
      240
      NaN
      N.A.
      N.A.
      74.30%
    
    
      222
      223
      Anguilla
      14906
      0.97%
      143
      166
      90
      NaN
      N.A.
      N.A.
      N.A.
    
    
      223
      224
      Wallis and Futuna
      13090
      -0.17%
      -22
      94
      140
      NaN
      N.A.
      N.A.
      0%
    
    
      224
      225
      Nauru
      10301
      0.37%
      38
      515
      20
      NaN
      N.A.
      N.A.
      98.80%
    
    
      225
      226
      Tuvalu
      9975
      0.32%
      32
      333
      30
      NaN
      N.A.
      N.A.
      60.40%
    
    
      226
      227
      Saint Pierre and Miquelon
      6319
      0.29%
      18
      28
      230
      NaN
      N.A.
      N.A.
      86.60%
    
    
      227
      228
      Montserrat
      5179
      0.49%
      25
      52
      100
      NaN
      N.A.
      N.A.
      9.10%
    
    
      228
      229
      Saint Helena
      3970
      0.35%
      14
      10
      389
      NaN
      N.A.
      N.A.
      41.20%
    
    
      229
      230
      Falkland Islands
      2919
      0.24%
      7
      0
      14595
      NaN
      N.A.
      N.A.
      80.50%
    
    
      230
      231
      Niue
      1614
      0.12%
      2
      6
      260
      NaN
      N.A.
      N.A.
      33.30%
    
    
      231
      232
      Tokelau
      1300
      1.88%
      24
      130
      10
      NaN
      N.A.
      N.A.
      0%
    
    
      232
      233
      Holy See
      801
      0%
      0
      1821
      0
      NaN
      N.A.
      N.A.
      N.A.
    
  

233 rows × 11 columns

It works, but the duble header line doesn't make us happy

There are two issues:

1) We rather have the country name as the row index 2) We don't want this ugly double line header above our columns.

Let's see what we can do about it using the options in read_csv.

One obvious thing to do is to tell read_csv to use the country column as the row index. This is done with the keyword col_index= .

The question is what to use for value. "Country" does not work (try it) because you can't access a column by name if it has more levels (here two). In that case just use the column number, says the error message that you get when you try it. So we use index_col=1



In [172]:

    
import pandas
worldPop = pandas.read_csv('worldPopulation2017.csv', header=[0, 1], index_col=1)
worldPop









    Out[172]:






  
    
      
      1
      (Population, -2017)
      (Yearly, Change)
      (Net, Change)
      (Density, (P/Km2))
      (Land Area, (Km2))
      (Migrants, (net))
      (Fert., Rate)
      (Med., Age)
      (Urban, Pop %)
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37
      57.60%
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27
      32%
    
    
      U.S.
      3
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38
      82.10%
    
    
      Indonesia
      4
      263510146
      1.12%
      2929046
      146
      1811066
      -140000.0
      2.5
      28
      53.40%
    
    
      Brazil
      5
      211243220
      0.80%
      1675300
      25
      8349534
      3185.0
      1.82
      31
      83.50%
    
    
      Pakistan
      6
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23
      38.10%
    
    
      Nigeria
      7
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18
      47.80%
    
    
      Bangladesh
      8
      164827718
      1.18%
      1916854
      1266
      130175
      -445296.0
      2.23
      26
      34.50%
    
    
      Russia
      9
      143375006
      -0.05%
      -64826
      9
      16292614
      223577.0
      1.66
      39
      73.20%
    
    
      Mexico
      10
      130222815
      1.24%
      1590811
      67
      1943624
      -104717.0
      2.29
      27
      77.30%
    
    
      Japan
      11
      126045211
      -0.22%
      -278504
      346
      364503
      70000.0
      1.4
      47
      94.30%
    
    
      Ethiopia
      12
      104344901
      2.45%
      2491633
      104
      1000430
      -12000.0
      4.59
      19
      19.40%
    
    
      Philippines
      13
      103796832
      1.51%
      1546699
      348
      298181
      -140000.0
      3.04
      24
      44.20%
    
    
      Viet Nam
      14
      95414640
      1.03%
      970440
      308
      310090
      -40000.0
      1.96
      30
      33.80%
    
    
      Egypt
      15
      95215102
      1.96%
      1831528
      96
      994933
      -43136.0
      3.38
      25
      39%
    
    
      DR Congo
      16
      82242685
      3.16%
      2520061
      36
      2265639
      -19184.0
      6.15
      17
      38.30%
    
    
      Iran
      17
      80945718
      1.13%
      902572
      50
      1628686
      -60000.0
      1.75
      30
      73.40%
    
    
      Germany
      18
      80636124
      -0.06%
      -46227
      231
      348621
      250000.0
      1.39
      46
      77.20%
    
    
      Turkey
      19
      80417526
      1%
      795464
      105
      769546
      400001.0
      2.1
      30
      71.20%
    
    
      Thailand
      20
      68297547
      0.22%
      150938
      134
      510827
      20000.0
      1.53
      38
      51%
    
    
      U.K.
      21
      65511098
      0.61%
      399955
      271
      241917
      180000.0
      1.92
      40
      81.20%
    
    
      France
      22
      64938716
      0.42%
      270587
      119
      547544
      66311.0
      2
      41
      80.20%
    
    
      Italy
      23
      59797978
      -0.01%
      -3026
      203
      294137
      105654.0
      1.43
      46
      70.70%
    
    
      Tanzania
      24
      56877529
      3.12%
      1722056
      64
      885943
      -40000.0
      5.24
      17
      30.60%
    
    
      South Africa
      25
      55436360
      0.83%
      457453
      46
      1213049
      120000.0
      2.4
      26
      63.40%
    
    
      Myanmar
      26
      54836483
      0.87%
      473057
      84
      653593
      -94856.0
      2.25
      28
      34.50%
    
    
      South Korea
      27
      50704971
      0.40%
      201038
      522
      97229
      60000.0
      1.26
      41
      81.40%
    
    
      Colombia
      28
      49067981
      0.85%
      413589
      44
      1110135
      -29000.0
      1.93
      30
      78.40%
    
    
      Kenya
      29
      48466928
      2.57%
      1215479
      85
      568861
      -10000.0
      4.44
      19
      25.80%
    
    
      Spain
      30
      46070146
      0.01%
      5542
      92
      498595
      -118614.0
      1.32
      43
      82%
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Andorra
      204
      68728
      -0.63%
      -437
      146
      470
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Cayman Islands
      205
      61557
      1.31%
      793
      257
      240
      NaN
      N.A.
      N.A.
      98.50%
    
    
      Bermuda
      206
      61352
      -0.50%
      -310
      1227
      50
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Saint Kitts and Nevis
      207
      56780
      1.06%
      597
      218
      260
      NaN
      N.A.
      N.A.
      31.70%
    
    
      Greenland
      208
      56239
      0.08%
      43
      0
      562390
      NaN
      N.A.
      N.A.
      88.50%
    
    
      American Samoa
      209
      55653
      0.09%
      51
      278
      200
      NaN
      N.A.
      N.A.
      87.40%
    
    
      Northern Mariana Islands
      210
      55567
      0.32%
      178
      121
      460
      NaN
      N.A.
      N.A.
      89%
    
    
      Marshall Islands
      211
      53132
      0.12%
      63
      295
      180
      NaN
      N.A.
      N.A.
      73.20%
    
    
      Faeroe Islands
      212
      48335
      0.20%
      96
      35
      1397
      NaN
      N.A.
      N.A.
      43.30%
    
    
      Sint Maarten
      213
      40117
      1.46%
      579
      1180
      34
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Liechtenstein
      214
      38022
      0.65%
      246
      238
      160
      NaN
      N.A.
      N.A.
      14.20%
    
    
      Monaco
      215
      38010
      0.39%
      147
      25510
      1
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Turks and Caicos Islands
      216
      35442
      1.54%
      538
      37
      950
      NaN
      N.A.
      N.A.
      91.10%
    
    
      Gibraltar
      217
      32472
      0.31%
      99
      3247
      10
      NaN
      N.A.
      N.A.
      90.40%
    
    
      San Marino
      218
      32104
      0.48%
      154
      535
      60
      NaN
      N.A.
      N.A.
      93.70%
    
    
      British Virgin Islands
      219
      31200
      1.76%
      541
      208
      150
      NaN
      N.A.
      N.A.
      43.30%
    
    
      Caribbean Netherlands
      220
      25699
      1.46%
      371
      78
      328
      NaN
      N.A.
      N.A.
      58.90%
    
    
      Palau
      221
      21726
      1.05%
      225
      47
      460
      NaN
      N.A.
      N.A.
      86.70%
    
    
      Cook Islands
      222
      21069
      0.58%
      121
      88
      240
      NaN
      N.A.
      N.A.
      74.30%
    
    
      Anguilla
      223
      14906
      0.97%
      143
      166
      90
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Wallis and Futuna
      224
      13090
      -0.17%
      -22
      94
      140
      NaN
      N.A.
      N.A.
      0%
    
    
      Nauru
      225
      10301
      0.37%
      38
      515
      20
      NaN
      N.A.
      N.A.
      98.80%
    
    
      Tuvalu
      226
      9975
      0.32%
      32
      333
      30
      NaN
      N.A.
      N.A.
      60.40%
    
    
      Saint Pierre and Miquelon
      227
      6319
      0.29%
      18
      28
      230
      NaN
      N.A.
      N.A.
      86.60%
    
    
      Montserrat
      228
      5179
      0.49%
      25
      52
      100
      NaN
      N.A.
      N.A.
      9.10%
    
    
      Saint Helena
      229
      3970
      0.35%
      14
      10
      389
      NaN
      N.A.
      N.A.
      41.20%
    
    
      Falkland Islands
      230
      2919
      0.24%
      7
      0
      14595
      NaN
      N.A.
      N.A.
      80.50%
    
    
      Niue
      231
      1614
      0.12%
      2
      6
      260
      NaN
      N.A.
      N.A.
      33.30%
    
    
      Tokelau
      232
      1300
      1.88%
      24
      130
      10
      NaN
      N.A.
      N.A.
      0%
    
    
      Holy See
      233
      801
      0%
      0
      1821
      0
      NaN
      N.A.
      N.A.
      N.A.
    
  

233 rows × 10 columns

This is better. But we still don't like the header above all the other columns.

It's easy to replace headings at any time with strings or other immutable objects of our liking.

Let's first see what these column headers look like.



In [173]:

    
worldPop.columns









    Out[173]:





Index([                      1, ('Population', '-2017'),
          ('Yearly', 'Change'),       ('Net', 'Change'),
        ('Density', '(P/Km2)'),  ('Land Area', '(Km2)'),
         ('Migrants', '(net)'),       ('Fert.', 'Rate'),
               ('Med.', 'Age'),      ('Urban', 'Pop %')],
      dtype='object')

This is a list of tuples, which can be more easily seen by printing each tuple on its own line.



In [176]:

    
for p in worldPop.columns: print(p)









    



1
('Population', '-2017')
('Yearly', 'Change')
('Net', 'Change')
('Density', '(P/Km2)')
('Land Area', '(Km2)')
('Migrants', '(net)')
('Fert.', 'Rate')
('Med.', 'Age')
('Urban', 'Pop %')

An idea is to join the strings from the two lines. And when done replace the first, which is not a tuple to the string "idx"



In [190]:

    
nwCols = [p[0] + p[1] if isinstance(p, tuple) else p for p in worldPop.columns]
print(nwCols)









    



[1, 'Population-2017', 'YearlyChange', 'NetChange', 'Density(P/Km2)', 'Land Area(Km2)', 'Migrants(net)', 'Fert.Rate', 'Med.Age', 'UrbanPop %']



In [191]:

    
nwCols[0] = "Idx"
for p in nwCols: print(p)









    



Idx
Population-2017
YearlyChange
NetChange
Density(P/Km2)
Land Area(Km2)
Migrants(net)
Fert.Rate
Med.Age
UrbanPop %

Nice, but some cleanup is still desired. We can do that with a few string methods placed in succession.



In [192]:

    
nc = [p.replace(' ','').replace('.','').replace('%','Perc') for p in nwCols]
print(nc)









    



['Idx', 'Population-2017', 'YearlyChange', 'NetChange', 'Density(P/Km2)', 'LandArea(Km2)', 'Migrants(net)', 'FertRate', 'MedAge', 'UrbanPopPerc']



In [194]:

    
worldPop.columns = nc
worldPop









    Out[194]:






  
    
      
      Idx
      Population-2017
      YearlyChange
      NetChange
      Density(P/Km2)
      LandArea(Km2)
      Migrants(net)
      FertRate
      MedAge
      UrbanPopPerc
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37
      57.60%
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27
      32%
    
    
      U.S.
      3
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38
      82.10%
    
    
      Indonesia
      4
      263510146
      1.12%
      2929046
      146
      1811066
      -140000.0
      2.5
      28
      53.40%
    
    
      Brazil
      5
      211243220
      0.80%
      1675300
      25
      8349534
      3185.0
      1.82
      31
      83.50%
    
    
      Pakistan
      6
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23
      38.10%
    
    
      Nigeria
      7
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18
      47.80%
    
    
      Bangladesh
      8
      164827718
      1.18%
      1916854
      1266
      130175
      -445296.0
      2.23
      26
      34.50%
    
    
      Russia
      9
      143375006
      -0.05%
      -64826
      9
      16292614
      223577.0
      1.66
      39
      73.20%
    
    
      Mexico
      10
      130222815
      1.24%
      1590811
      67
      1943624
      -104717.0
      2.29
      27
      77.30%
    
    
      Japan
      11
      126045211
      -0.22%
      -278504
      346
      364503
      70000.0
      1.4
      47
      94.30%
    
    
      Ethiopia
      12
      104344901
      2.45%
      2491633
      104
      1000430
      -12000.0
      4.59
      19
      19.40%
    
    
      Philippines
      13
      103796832
      1.51%
      1546699
      348
      298181
      -140000.0
      3.04
      24
      44.20%
    
    
      Viet Nam
      14
      95414640
      1.03%
      970440
      308
      310090
      -40000.0
      1.96
      30
      33.80%
    
    
      Egypt
      15
      95215102
      1.96%
      1831528
      96
      994933
      -43136.0
      3.38
      25
      39%
    
    
      DR Congo
      16
      82242685
      3.16%
      2520061
      36
      2265639
      -19184.0
      6.15
      17
      38.30%
    
    
      Iran
      17
      80945718
      1.13%
      902572
      50
      1628686
      -60000.0
      1.75
      30
      73.40%
    
    
      Germany
      18
      80636124
      -0.06%
      -46227
      231
      348621
      250000.0
      1.39
      46
      77.20%
    
    
      Turkey
      19
      80417526
      1%
      795464
      105
      769546
      400001.0
      2.1
      30
      71.20%
    
    
      Thailand
      20
      68297547
      0.22%
      150938
      134
      510827
      20000.0
      1.53
      38
      51%
    
    
      U.K.
      21
      65511098
      0.61%
      399955
      271
      241917
      180000.0
      1.92
      40
      81.20%
    
    
      France
      22
      64938716
      0.42%
      270587
      119
      547544
      66311.0
      2
      41
      80.20%
    
    
      Italy
      23
      59797978
      -0.01%
      -3026
      203
      294137
      105654.0
      1.43
      46
      70.70%
    
    
      Tanzania
      24
      56877529
      3.12%
      1722056
      64
      885943
      -40000.0
      5.24
      17
      30.60%
    
    
      South Africa
      25
      55436360
      0.83%
      457453
      46
      1213049
      120000.0
      2.4
      26
      63.40%
    
    
      Myanmar
      26
      54836483
      0.87%
      473057
      84
      653593
      -94856.0
      2.25
      28
      34.50%
    
    
      South Korea
      27
      50704971
      0.40%
      201038
      522
      97229
      60000.0
      1.26
      41
      81.40%
    
    
      Colombia
      28
      49067981
      0.85%
      413589
      44
      1110135
      -29000.0
      1.93
      30
      78.40%
    
    
      Kenya
      29
      48466928
      2.57%
      1215479
      85
      568861
      -10000.0
      4.44
      19
      25.80%
    
    
      Spain
      30
      46070146
      0.01%
      5542
      92
      498595
      -118614.0
      1.32
      43
      82%
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Andorra
      204
      68728
      -0.63%
      -437
      146
      470
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Cayman Islands
      205
      61557
      1.31%
      793
      257
      240
      NaN
      N.A.
      N.A.
      98.50%
    
    
      Bermuda
      206
      61352
      -0.50%
      -310
      1227
      50
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Saint Kitts and Nevis
      207
      56780
      1.06%
      597
      218
      260
      NaN
      N.A.
      N.A.
      31.70%
    
    
      Greenland
      208
      56239
      0.08%
      43
      0
      562390
      NaN
      N.A.
      N.A.
      88.50%
    
    
      American Samoa
      209
      55653
      0.09%
      51
      278
      200
      NaN
      N.A.
      N.A.
      87.40%
    
    
      Northern Mariana Islands
      210
      55567
      0.32%
      178
      121
      460
      NaN
      N.A.
      N.A.
      89%
    
    
      Marshall Islands
      211
      53132
      0.12%
      63
      295
      180
      NaN
      N.A.
      N.A.
      73.20%
    
    
      Faeroe Islands
      212
      48335
      0.20%
      96
      35
      1397
      NaN
      N.A.
      N.A.
      43.30%
    
    
      Sint Maarten
      213
      40117
      1.46%
      579
      1180
      34
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Liechtenstein
      214
      38022
      0.65%
      246
      238
      160
      NaN
      N.A.
      N.A.
      14.20%
    
    
      Monaco
      215
      38010
      0.39%
      147
      25510
      1
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Turks and Caicos Islands
      216
      35442
      1.54%
      538
      37
      950
      NaN
      N.A.
      N.A.
      91.10%
    
    
      Gibraltar
      217
      32472
      0.31%
      99
      3247
      10
      NaN
      N.A.
      N.A.
      90.40%
    
    
      San Marino
      218
      32104
      0.48%
      154
      535
      60
      NaN
      N.A.
      N.A.
      93.70%
    
    
      British Virgin Islands
      219
      31200
      1.76%
      541
      208
      150
      NaN
      N.A.
      N.A.
      43.30%
    
    
      Caribbean Netherlands
      220
      25699
      1.46%
      371
      78
      328
      NaN
      N.A.
      N.A.
      58.90%
    
    
      Palau
      221
      21726
      1.05%
      225
      47
      460
      NaN
      N.A.
      N.A.
      86.70%
    
    
      Cook Islands
      222
      21069
      0.58%
      121
      88
      240
      NaN
      N.A.
      N.A.
      74.30%
    
    
      Anguilla
      223
      14906
      0.97%
      143
      166
      90
      NaN
      N.A.
      N.A.
      N.A.
    
    
      Wallis and Futuna
      224
      13090
      -0.17%
      -22
      94
      140
      NaN
      N.A.
      N.A.
      0%
    
    
      Nauru
      225
      10301
      0.37%
      38
      515
      20
      NaN
      N.A.
      N.A.
      98.80%
    
    
      Tuvalu
      226
      9975
      0.32%
      32
      333
      30
      NaN
      N.A.
      N.A.
      60.40%
    
    
      Saint Pierre and Miquelon
      227
      6319
      0.29%
      18
      28
      230
      NaN
      N.A.
      N.A.
      86.60%
    
    
      Montserrat
      228
      5179
      0.49%
      25
      52
      100
      NaN
      N.A.
      N.A.
      9.10%
    
    
      Saint Helena
      229
      3970
      0.35%
      14
      10
      389
      NaN
      N.A.
      N.A.
      41.20%
    
    
      Falkland Islands
      230
      2919
      0.24%
      7
      0
      14595
      NaN
      N.A.
      N.A.
      80.50%
    
    
      Niue
      231
      1614
      0.12%
      2
      6
      260
      NaN
      N.A.
      N.A.
      33.30%
    
    
      Tokelau
      232
      1300
      1.88%
      24
      130
      10
      NaN
      N.A.
      N.A.
      0%
    
    
      Holy See
      233
      801
      0%
      0
      1821
      0
      NaN
      N.A.
      N.A.
      N.A.
    
  

233 rows × 10 columns

There is yet another issue

That is, that some columns have NaN for not available while other columns have the string 'N.A.' for the same thing.



In [208]:

    
worldPop[["Migrants(net)", "FertRate"]] # we don't have to use .loc for entire columns









    Out[208]:






  
    
      
      Migrants(net)
      FertRate
    
  
  
    
      China
      -360000.0
      1.55
    
    
      India
      -519644.0
      2.48
    
    
      U.S.
      1001577.0
      1.89
    
    
      Indonesia
      -140000.0
      2.5
    
    
      Brazil
      3185.0
      1.82
    
    
      Pakistan
      -216384.0
      3.72
    
    
      Nigeria
      -60000.0
      5.74
    
    
      Bangladesh
      -445296.0
      2.23
    
    
      Russia
      223577.0
      1.66
    
    
      Mexico
      -104717.0
      2.29
    
    
      Japan
      70000.0
      1.4
    
    
      Ethiopia
      -12000.0
      4.59
    
    
      Philippines
      -140000.0
      3.04
    
    
      Viet Nam
      -40000.0
      1.96
    
    
      Egypt
      -43136.0
      3.38
    
    
      DR Congo
      -19184.0
      6.15
    
    
      Iran
      -60000.0
      1.75
    
    
      Germany
      250000.0
      1.39
    
    
      Turkey
      400001.0
      2.1
    
    
      Thailand
      20000.0
      1.53
    
    
      U.K.
      180000.0
      1.92
    
    
      France
      66311.0
      2
    
    
      Italy
      105654.0
      1.43
    
    
      Tanzania
      -40000.0
      5.24
    
    
      South Africa
      120000.0
      2.4
    
    
      Myanmar
      -94856.0
      2.25
    
    
      South Korea
      60000.0
      1.26
    
    
      Colombia
      -29000.0
      1.93
    
    
      Kenya
      -10000.0
      4.44
    
    
      Spain
      -118614.0
      1.32
    
    
      ...
      ...
      ...
    
    
      Andorra
      NaN
      N.A.
    
    
      Cayman Islands
      NaN
      N.A.
    
    
      Bermuda
      NaN
      N.A.
    
    
      Saint Kitts and Nevis
      NaN
      N.A.
    
    
      Greenland
      NaN
      N.A.
    
    
      American Samoa
      NaN
      N.A.
    
    
      Northern Mariana Islands
      NaN
      N.A.
    
    
      Marshall Islands
      NaN
      N.A.
    
    
      Faeroe Islands
      NaN
      N.A.
    
    
      Sint Maarten
      NaN
      N.A.
    
    
      Liechtenstein
      NaN
      N.A.
    
    
      Monaco
      NaN
      N.A.
    
    
      Turks and Caicos Islands
      NaN
      N.A.
    
    
      Gibraltar
      NaN
      N.A.
    
    
      San Marino
      NaN
      N.A.
    
    
      British Virgin Islands
      NaN
      N.A.
    
    
      Caribbean Netherlands
      NaN
      N.A.
    
    
      Palau
      NaN
      N.A.
    
    
      Cook Islands
      NaN
      N.A.
    
    
      Anguilla
      NaN
      N.A.
    
    
      Wallis and Futuna
      NaN
      N.A.
    
    
      Nauru
      NaN
      N.A.
    
    
      Tuvalu
      NaN
      N.A.
    
    
      Saint Pierre and Miquelon
      NaN
      N.A.
    
    
      Montserrat
      NaN
      N.A.
    
    
      Saint Helena
      NaN
      N.A.
    
    
      Falkland Islands
      NaN
      N.A.
    
    
      Niue
      NaN
      N.A.
    
    
      Tokelau
      NaN
      N.A.
    
    
      Holy See
      NaN
      N.A.
    
  

233 rows × 2 columns

The 'N.A.' are not considered NaN but just as string values. Let's replace them or read the worldPop file in once more by specifying that 'N.A' should be interpreted as not available. The keyword is na_values='N.A'



In [242]:

    
import pandas
worldPop = pandas.read_csv('worldPopulation2017.csv', header=[0, 1], index_col=1, na_values='N.A.')

# also replace the headers as was done before
nwCols = [p[0] + p[1] if isinstance(p, tuple) else p for p in worldPop.columns]
nwCols[0] = "Idx"
worldPop.columns = [p.replace(' ','').replace('.','').replace('%','Perc') for p in nwCols]

# and show it
worldPop









    Out[242]:






  
    
      
      Idx
      Population-2017
      YearlyChange
      NetChange
      Density(P/Km2)
      LandArea(Km2)
      Migrants(net)
      FertRate
      MedAge
      UrbanPopPerc
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37.0
      57.60%
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27.0
      32%
    
    
      U.S.
      3
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38.0
      82.10%
    
    
      Indonesia
      4
      263510146
      1.12%
      2929046
      146
      1811066
      -140000.0
      2.50
      28.0
      53.40%
    
    
      Brazil
      5
      211243220
      0.80%
      1675300
      25
      8349534
      3185.0
      1.82
      31.0
      83.50%
    
    
      Pakistan
      6
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23.0
      38.10%
    
    
      Nigeria
      7
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18.0
      47.80%
    
    
      Bangladesh
      8
      164827718
      1.18%
      1916854
      1266
      130175
      -445296.0
      2.23
      26.0
      34.50%
    
    
      Russia
      9
      143375006
      -0.05%
      -64826
      9
      16292614
      223577.0
      1.66
      39.0
      73.20%
    
    
      Mexico
      10
      130222815
      1.24%
      1590811
      67
      1943624
      -104717.0
      2.29
      27.0
      77.30%
    
    
      Japan
      11
      126045211
      -0.22%
      -278504
      346
      364503
      70000.0
      1.40
      47.0
      94.30%
    
    
      Ethiopia
      12
      104344901
      2.45%
      2491633
      104
      1000430
      -12000.0
      4.59
      19.0
      19.40%
    
    
      Philippines
      13
      103796832
      1.51%
      1546699
      348
      298181
      -140000.0
      3.04
      24.0
      44.20%
    
    
      Viet Nam
      14
      95414640
      1.03%
      970440
      308
      310090
      -40000.0
      1.96
      30.0
      33.80%
    
    
      Egypt
      15
      95215102
      1.96%
      1831528
      96
      994933
      -43136.0
      3.38
      25.0
      39%
    
    
      DR Congo
      16
      82242685
      3.16%
      2520061
      36
      2265639
      -19184.0
      6.15
      17.0
      38.30%
    
    
      Iran
      17
      80945718
      1.13%
      902572
      50
      1628686
      -60000.0
      1.75
      30.0
      73.40%
    
    
      Germany
      18
      80636124
      -0.06%
      -46227
      231
      348621
      250000.0
      1.39
      46.0
      77.20%
    
    
      Turkey
      19
      80417526
      1%
      795464
      105
      769546
      400001.0
      2.10
      30.0
      71.20%
    
    
      Thailand
      20
      68297547
      0.22%
      150938
      134
      510827
      20000.0
      1.53
      38.0
      51%
    
    
      U.K.
      21
      65511098
      0.61%
      399955
      271
      241917
      180000.0
      1.92
      40.0
      81.20%
    
    
      France
      22
      64938716
      0.42%
      270587
      119
      547544
      66311.0
      2.00
      41.0
      80.20%
    
    
      Italy
      23
      59797978
      -0.01%
      -3026
      203
      294137
      105654.0
      1.43
      46.0
      70.70%
    
    
      Tanzania
      24
      56877529
      3.12%
      1722056
      64
      885943
      -40000.0
      5.24
      17.0
      30.60%
    
    
      South Africa
      25
      55436360
      0.83%
      457453
      46
      1213049
      120000.0
      2.40
      26.0
      63.40%
    
    
      Myanmar
      26
      54836483
      0.87%
      473057
      84
      653593
      -94856.0
      2.25
      28.0
      34.50%
    
    
      South Korea
      27
      50704971
      0.40%
      201038
      522
      97229
      60000.0
      1.26
      41.0
      81.40%
    
    
      Colombia
      28
      49067981
      0.85%
      413589
      44
      1110135
      -29000.0
      1.93
      30.0
      78.40%
    
    
      Kenya
      29
      48466928
      2.57%
      1215479
      85
      568861
      -10000.0
      4.44
      19.0
      25.80%
    
    
      Spain
      30
      46070146
      0.01%
      5542
      92
      498595
      -118614.0
      1.32
      43.0
      82%
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Andorra
      204
      68728
      -0.63%
      -437
      146
      470
      NaN
      NaN
      NaN
      NaN
    
    
      Cayman Islands
      205
      61557
      1.31%
      793
      257
      240
      NaN
      NaN
      NaN
      98.50%
    
    
      Bermuda
      206
      61352
      -0.50%
      -310
      1227
      50
      NaN
      NaN
      NaN
      NaN
    
    
      Saint Kitts and Nevis
      207
      56780
      1.06%
      597
      218
      260
      NaN
      NaN
      NaN
      31.70%
    
    
      Greenland
      208
      56239
      0.08%
      43
      0
      562390
      NaN
      NaN
      NaN
      88.50%
    
    
      American Samoa
      209
      55653
      0.09%
      51
      278
      200
      NaN
      NaN
      NaN
      87.40%
    
    
      Northern Mariana Islands
      210
      55567
      0.32%
      178
      121
      460
      NaN
      NaN
      NaN
      89%
    
    
      Marshall Islands
      211
      53132
      0.12%
      63
      295
      180
      NaN
      NaN
      NaN
      73.20%
    
    
      Faeroe Islands
      212
      48335
      0.20%
      96
      35
      1397
      NaN
      NaN
      NaN
      43.30%
    
    
      Sint Maarten
      213
      40117
      1.46%
      579
      1180
      34
      NaN
      NaN
      NaN
      NaN
    
    
      Liechtenstein
      214
      38022
      0.65%
      246
      238
      160
      NaN
      NaN
      NaN
      14.20%
    
    
      Monaco
      215
      38010
      0.39%
      147
      25510
      1
      NaN
      NaN
      NaN
      NaN
    
    
      Turks and Caicos Islands
      216
      35442
      1.54%
      538
      37
      950
      NaN
      NaN
      NaN
      91.10%
    
    
      Gibraltar
      217
      32472
      0.31%
      99
      3247
      10
      NaN
      NaN
      NaN
      90.40%
    
    
      San Marino
      218
      32104
      0.48%
      154
      535
      60
      NaN
      NaN
      NaN
      93.70%
    
    
      British Virgin Islands
      219
      31200
      1.76%
      541
      208
      150
      NaN
      NaN
      NaN
      43.30%
    
    
      Caribbean Netherlands
      220
      25699
      1.46%
      371
      78
      328
      NaN
      NaN
      NaN
      58.90%
    
    
      Palau
      221
      21726
      1.05%
      225
      47
      460
      NaN
      NaN
      NaN
      86.70%
    
    
      Cook Islands
      222
      21069
      0.58%
      121
      88
      240
      NaN
      NaN
      NaN
      74.30%
    
    
      Anguilla
      223
      14906
      0.97%
      143
      166
      90
      NaN
      NaN
      NaN
      NaN
    
    
      Wallis and Futuna
      224
      13090
      -0.17%
      -22
      94
      140
      NaN
      NaN
      NaN
      0%
    
    
      Nauru
      225
      10301
      0.37%
      38
      515
      20
      NaN
      NaN
      NaN
      98.80%
    
    
      Tuvalu
      226
      9975
      0.32%
      32
      333
      30
      NaN
      NaN
      NaN
      60.40%
    
    
      Saint Pierre and Miquelon
      227
      6319
      0.29%
      18
      28
      230
      NaN
      NaN
      NaN
      86.60%
    
    
      Montserrat
      228
      5179
      0.49%
      25
      52
      100
      NaN
      NaN
      NaN
      9.10%
    
    
      Saint Helena
      229
      3970
      0.35%
      14
      10
      389
      NaN
      NaN
      NaN
      41.20%
    
    
      Falkland Islands
      230
      2919
      0.24%
      7
      0
      14595
      NaN
      NaN
      NaN
      80.50%
    
    
      Niue
      231
      1614
      0.12%
      2
      6
      260
      NaN
      NaN
      NaN
      33.30%
    
    
      Tokelau
      232
      1300
      1.88%
      24
      130
      10
      NaN
      NaN
      NaN
      0%
    
    
      Holy See
      233
      801
      0%
      0
      1821
      0
      NaN
      NaN
      NaN
      NaN
    
  

233 rows × 10 columns

Let's ask questions:



In [237]:

    
# Which countries have more people than the US?

worldPop.loc[worldPop["Population-2017"] > worldPop.loc["U.S.", "Population-2017"]]









    Out[237]:






  
    
      
      Idx
      Population-2017
      YearlyChange
      NetChange
      Density(P/Km2)
      LandArea(Km2)
      Migrants(net)
      FertRate
      MedAge
      UrbanPopPerc
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37.0
      57.60%
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27.0
      32%



In [238]:

    
# Wich countries are larger than Nigeria in 25 years?

p17 = "Population-2017"
p50 = "Population-2050"

worldPop[p50] = worldPop[p17] * (1 + worldPop["FertRate"]/100) ** (2050 - 2017)



In [239]:

    
worldPop









    Out[239]:






  
    
      
      Idx
      Population-2017
      YearlyChange
      NetChange
      Density(P/Km2)
      LandArea(Km2)
      Migrants(net)
      FertRate
      MedAge
      UrbanPopPerc
      Population-2050
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37.0
      57.60%
      2.306215e+09
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27.0
      32%
      3.013070e+09
    
    
      U.S.
      3
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38.0
      82.10%
      6.056060e+08
    
    
      Indonesia
      4
      263510146
      1.12%
      2929046
      146
      1811066
      -140000.0
      2.50
      28.0
      53.40%
      5.952301e+08
    
    
      Brazil
      5
      211243220
      0.80%
      1675300
      25
      8349534
      3185.0
      1.82
      31.0
      83.50%
      3.830671e+08
    
    
      Pakistan
      6
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23.0
      38.10%
      6.566971e+08
    
    
      Nigeria
      7
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18.0
      47.80%
      1.210116e+09
    
    
      Bangladesh
      8
      164827718
      1.18%
      1916854
      1266
      130175
      -445296.0
      2.23
      26.0
      34.50%
      3.412842e+08
    
    
      Russia
      9
      143375006
      -0.05%
      -64826
      9
      16292614
      223577.0
      1.66
      39.0
      73.20%
      2.468465e+08
    
    
      Mexico
      10
      130222815
      1.24%
      1590811
      67
      1943624
      -104717.0
      2.29
      27.0
      77.30%
      2.749046e+08
    
    
      Japan
      11
      126045211
      -0.22%
      -278504
      346
      364503
      70000.0
      1.40
      47.0
      94.30%
      1.994247e+08
    
    
      Ethiopia
      12
      104344901
      2.45%
      2491633
      104
      1000430
      -12000.0
      4.59
      19.0
      19.40%
      4.588245e+08
    
    
      Philippines
      13
      103796832
      1.51%
      1546699
      348
      298181
      -140000.0
      3.04
      24.0
      44.20%
      2.788542e+08
    
    
      Viet Nam
      14
      95414640
      1.03%
      970440
      308
      310090
      -40000.0
      1.96
      30.0
      33.80%
      1.810503e+08
    
    
      Egypt
      15
      95215102
      1.96%
      1831528
      96
      994933
      -43136.0
      3.38
      25.0
      39%
      2.851749e+08
    
    
      DR Congo
      16
      82242685
      3.16%
      2520061
      36
      2265639
      -19184.0
      6.15
      17.0
      38.30%
      5.894639e+08
    
    
      Iran
      17
      80945718
      1.13%
      902572
      50
      1628686
      -60000.0
      1.75
      30.0
      73.40%
      1.434927e+08
    
    
      Germany
      18
      80636124
      -0.06%
      -46227
      231
      348621
      250000.0
      1.39
      46.0
      77.20%
      1.271653e+08
    
    
      Turkey
      19
      80417526
      1%
      795464
      105
      769546
      400001.0
      2.10
      30.0
      71.20%
      1.596615e+08
    
    
      Thailand
      20
      68297547
      0.22%
      150938
      134
      510827
      20000.0
      1.53
      38.0
      51%
      1.127249e+08
    
    
      U.K.
      21
      65511098
      0.61%
      399955
      271
      241917
      180000.0
      1.92
      40.0
      81.20%
      1.227088e+08
    
    
      France
      22
      64938716
      0.42%
      270587
      119
      547544
      66311.0
      2.00
      41.0
      80.20%
      1.248272e+08
    
    
      Italy
      23
      59797978
      -0.01%
      -3026
      203
      294137
      105654.0
      1.43
      46.0
      70.70%
      9.553853e+07
    
    
      Tanzania
      24
      56877529
      3.12%
      1722056
      64
      885943
      -40000.0
      5.24
      17.0
      30.60%
      3.068375e+08
    
    
      South Africa
      25
      55436360
      0.83%
      457453
      46
      1213049
      120000.0
      2.40
      26.0
      63.40%
      1.212532e+08
    
    
      Myanmar
      26
      54836483
      0.87%
      473057
      84
      653593
      -94856.0
      2.25
      28.0
      34.50%
      1.142771e+08
    
    
      South Korea
      27
      50704971
      0.40%
      201038
      522
      97229
      60000.0
      1.26
      41.0
      81.40%
      7.664821e+07
    
    
      Colombia
      28
      49067981
      0.85%
      413589
      44
      1110135
      -29000.0
      1.93
      30.0
      78.40%
      9.220723e+07
    
    
      Kenya
      29
      48466928
      2.57%
      1215479
      85
      568861
      -10000.0
      4.44
      19.0
      25.80%
      2.032600e+08
    
    
      Spain
      30
      46070146
      0.01%
      5542
      92
      498595
      -118614.0
      1.32
      43.0
      82%
      7.101671e+07
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Andorra
      204
      68728
      -0.63%
      -437
      146
      470
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Cayman Islands
      205
      61557
      1.31%
      793
      257
      240
      NaN
      NaN
      NaN
      98.50%
      NaN
    
    
      Bermuda
      206
      61352
      -0.50%
      -310
      1227
      50
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Saint Kitts and Nevis
      207
      56780
      1.06%
      597
      218
      260
      NaN
      NaN
      NaN
      31.70%
      NaN
    
    
      Greenland
      208
      56239
      0.08%
      43
      0
      562390
      NaN
      NaN
      NaN
      88.50%
      NaN
    
    
      American Samoa
      209
      55653
      0.09%
      51
      278
      200
      NaN
      NaN
      NaN
      87.40%
      NaN
    
    
      Northern Mariana Islands
      210
      55567
      0.32%
      178
      121
      460
      NaN
      NaN
      NaN
      89%
      NaN
    
    
      Marshall Islands
      211
      53132
      0.12%
      63
      295
      180
      NaN
      NaN
      NaN
      73.20%
      NaN
    
    
      Faeroe Islands
      212
      48335
      0.20%
      96
      35
      1397
      NaN
      NaN
      NaN
      43.30%
      NaN
    
    
      Sint Maarten
      213
      40117
      1.46%
      579
      1180
      34
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Liechtenstein
      214
      38022
      0.65%
      246
      238
      160
      NaN
      NaN
      NaN
      14.20%
      NaN
    
    
      Monaco
      215
      38010
      0.39%
      147
      25510
      1
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Turks and Caicos Islands
      216
      35442
      1.54%
      538
      37
      950
      NaN
      NaN
      NaN
      91.10%
      NaN
    
    
      Gibraltar
      217
      32472
      0.31%
      99
      3247
      10
      NaN
      NaN
      NaN
      90.40%
      NaN
    
    
      San Marino
      218
      32104
      0.48%
      154
      535
      60
      NaN
      NaN
      NaN
      93.70%
      NaN
    
    
      British Virgin Islands
      219
      31200
      1.76%
      541
      208
      150
      NaN
      NaN
      NaN
      43.30%
      NaN
    
    
      Caribbean Netherlands
      220
      25699
      1.46%
      371
      78
      328
      NaN
      NaN
      NaN
      58.90%
      NaN
    
    
      Palau
      221
      21726
      1.05%
      225
      47
      460
      NaN
      NaN
      NaN
      86.70%
      NaN
    
    
      Cook Islands
      222
      21069
      0.58%
      121
      88
      240
      NaN
      NaN
      NaN
      74.30%
      NaN
    
    
      Anguilla
      223
      14906
      0.97%
      143
      166
      90
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Wallis and Futuna
      224
      13090
      -0.17%
      -22
      94
      140
      NaN
      NaN
      NaN
      0%
      NaN
    
    
      Nauru
      225
      10301
      0.37%
      38
      515
      20
      NaN
      NaN
      NaN
      98.80%
      NaN
    
    
      Tuvalu
      226
      9975
      0.32%
      32
      333
      30
      NaN
      NaN
      NaN
      60.40%
      NaN
    
    
      Saint Pierre and Miquelon
      227
      6319
      0.29%
      18
      28
      230
      NaN
      NaN
      NaN
      86.60%
      NaN
    
    
      Montserrat
      228
      5179
      0.49%
      25
      52
      100
      NaN
      NaN
      NaN
      9.10%
      NaN
    
    
      Saint Helena
      229
      3970
      0.35%
      14
      10
      389
      NaN
      NaN
      NaN
      41.20%
      NaN
    
    
      Falkland Islands
      230
      2919
      0.24%
      7
      0
      14595
      NaN
      NaN
      NaN
      80.50%
      NaN
    
    
      Niue
      231
      1614
      0.12%
      2
      6
      260
      NaN
      NaN
      NaN
      33.30%
      NaN
    
    
      Tokelau
      232
      1300
      1.88%
      24
      130
      10
      NaN
      NaN
      NaN
      0%
      NaN
    
    
      Holy See
      233
      801
      0%
      0
      1821
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

233 rows × 11 columns



In [241]:

    
worldPop.loc[worldPop[p50] >= worldPop.loc["U.S.", p50]]









    Out[241]:






  
    
      
      Idx
      Population-2017
      YearlyChange
      NetChange
      Density(P/Km2)
      LandArea(Km2)
      Migrants(net)
      FertRate
      MedAge
      UrbanPopPerc
      Population-2050
    
  
  
    
      China
      1
      1388232693
      0.43%
      5909361
      148
      9386293
      -360000.0
      1.55
      37.0
      57.60%
      2.306215e+09
    
    
      India
      2
      1342512706
      1.18%
      15711130
      452
      2973450
      -519644.0
      2.48
      27.0
      32%
      3.013070e+09
    
    
      U.S.
      3
      326474013
      0.73%
      2355226
      36
      9144930
      1001577.0
      1.89
      38.0
      82.10%
      6.056060e+08
    
    
      Pakistan
      6
      196744376
      2.03%
      3917874
      255
      770942
      -216384.0
      3.72
      23.0
      38.10%
      6.566971e+08
    
    
      Nigeria
      7
      191835936
      2.59%
      4848373
      211
      910902
      -60000.0
      5.74
      18.0
      47.80%
      1.210116e+09

Time series data

In time series data, one of the columns represents dates, sometimes including times, together referred to as datetimes. pandas can be used to read csv files where one of the columns includes datetime data. You need to tell pandas which column contains datetime values and pandas will try to convert that column to datetime objects. Datetime objects are very convenient as specifics of the datetime object may be assessed with the dot syntax: .year returns the year, .month returns the month, etc.

For example, consider the following data stored in the file timeseries1.dat

date, conc
2014-04-01, 0.19
2014-04-02, 0.23
2014-04-03, 0.32
2014-04-04, 0.29

The file may be read with read_csv using the keyword parse_dates=[0] so that column number 0 is converted to datetimes



In [96]:

    
data = read_csv('timeseries1.dat', parse_dates=[0])
print(data)









    



        date   conc
0 2014-04-01   0.19
1 2014-04-02   0.23
2 2014-04-03   0.32
3 2014-04-04   0.29
4 2014-04-05   0.32

The rows of the DataFrame data are numbered, as we have not told pandas what column to use as the index of the rows. The first column of the DataFrame data has datetime values. We can access, for example, the year, month, or day with the dot syntax



In [97]:

    
print('datetime of row 0:', data.iloc[0,0])
print('year  of row 0:', data.iloc[0,0].year)
print('month of row 0:', data.iloc[0,0].month)
print('day   of row 0:', data.iloc[0,0].day)









    



datetime of row 0: 2014-04-01 00:00:00
year  of row 0: 2014
month of row 0: 4
day   of row 0: 1

Time series data may also contain the time in addition to the date. For example, the data of the file timeseries2.dat, shown below, contains the day and time. You can access the hour or minutes, but also the time of a row of the DataFrame with the .time() function.

date, conc
2014-04-01 12:00:00, 0.19
2014-04-01 13:00:00, 0.20
2014-04-01 14:00:00, 0.23
2014-04-01 15:00:00, 0.21



In [98]:

    
data = read_csv('timeseries2.dat', parse_dates=[0])
print(data)
print('hour  of row 0:', data.iloc[0,0].hour)
print('minute  of row 0:', data.iloc[0,0].minute)
print('time of row 0:', data.iloc[0,0].time())









    



                 date   conc
0 2014-04-01 12:00:00   0.19
1 2014-04-01 13:00:00   0.20
2 2014-04-01 14:00:00   0.23
3 2014-04-01 15:00:00   0.21
hour  of row 0: 12
minute  of row 0: 0
time of row 0: 12:00:00

Exercise 3, Load and plot daily rainfall

Rainfall data for the Netherlands may be obtained from the website of the Royal Dutch Meteorological Society KNMI . Daily rainfall for the weather station Rotterdam in 2012 is stored in the file rotterdam_rainfall_2012.txt. First open the file in a text editor to see what the file looks like. At the top of the file, an explanation is given of the data in the file. Read this. Load the data file with the read_csv function of pandas. Use the keyword skiprows to skip all rows except for the row with the names of the columns. Use the keyword parse_dates to give either the name or number of the column that needs to be converted to a datetime. Don't forget the skipinitialspace keyword, else the names of the columns may start with a bunch of spaces. Perform the following tasks:

Some rainfall values in the dataset may be -1 (read the header of the file to learn why); set all rainfall values that are less than zero to zero.
Use the plot function of pandas to create a line plot of the daily rainfall with the number of the day (so not the date) along the horizontal axis.
Use matplotlib functions to add labels to the axes and set the limits along the horizontal axis from 0 to 365.
Determine the maximum daily rainfall and the date of the maximum daily rainfall and print them to the screen.



In [99]:

    
rain = read_csv('rotterdam_rainfall_2012.txt',skiprows=9,
                parse_dates=['YYYYMMDD'],skipinitialspace=True)
rain.iloc[:,2] = rain.iloc[:,2] * 0.1
rain.RH[rain.RH<0] = 0
rain.RH.plot()
plt.xlabel('day')
plt.ylabel('daily rainfall (mm/day)')
plt.xlim(0,365)
print('Maximum daily rainfall',rain.RH.max())
print('Date of maximum daily rainfall',rain.YYYYMMDD[rain.RH.argmax()])









    



Maximum daily rainfall 22.4
Date of maximum daily rainfall 2012-12-22 00:00:00






    



/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Exercise 4, Compute monthly rainfall from daily rainfall

In this exercise we are going to compute the total monthly rainfall for 2012 in the City of Rotterdam using the daily rainfall measurements we loaded in the previous Exercise; later on in this Notebook we learn convenient functions from pandas to do this, but here we are going to do this with a loop. Create an array of 12 zeros to store the monthly totals and loop through all the days in 2012 to compute the total rainfall for each month. The month associated with each row of the DataFrame may be obtained with the .month syntax, as shown above. Print the monthly totals (in mm/month) to the screen and create a bar graph of the total monthly rainfall (in mm/month) vs. the month using the plt.bar function of matplotlib.



In [101]:

    
monthlyrain = np.zeros(12)
for i in range(len(rain)):
    month = rain.iloc[i,1].month
    monthlyrain[month-1] += rain.iloc[i,2]
print(monthlyrain)
#
plt.bar(np.arange(12),monthlyrain,width=1)
plt.xlabel('month')
plt.ylabel('monthly rainfall (mm/month)')
plt.xticks(np.arange(12)+0.5,['J','F','M','A','M','J','J','A','S','O','N','D'])
plt.xlim(0,12)









    



[  83.    24.3   21.9   57.6   76.5  119.   121.6   93.4   52.   132.6
   63.3  149.5]






    Out[101]:





(0, 12)

Date times as index

The datetime of a dataset may also be used as the index of the DataFrame by specifying the column with the dates as the column to use for an index with the index_col keyword. Note that datetimes are given as year-month-day, so 2012-04-01 means April 1, 2012.



In [102]:

    
data = read_csv('timeseries1.dat', parse_dates=[0], index_col=0)
print(data)
print('data on April 1:',data.loc['2014-04-01'])
print('data on April 2:',data.loc['2014-04-02'])









    



             conc
date             
2014-04-01   0.19
2014-04-02   0.23
2014-04-03   0.32
2014-04-04   0.29
2014-04-05   0.32
data on April 1:  conc    0.19
Name: 2014-04-01 00:00:00, dtype: float64
data on April 2:  conc    0.23
Name: 2014-04-02 00:00:00, dtype: float64

Resampling

DataFrames have a very powerful feature called resampling. Downsampling refers to going from high frequency to low frequency. For example, going from daily data to monthly data. Upsampling refers to going from low frequency to high frequency. For example going from monthly data to daily data. For both upsampling and downsampling, you need to tell pandas how to perform the resampling. Here we discuss downsampling, where we computed monthly totals from daily values. First we load the daily rainfall in Rotterdam in 2012 from the file rotterdam_rainfall_2012.txt and specify the dates as the index (this is the column labeled as YYYYMMDD). We resample the rain to monthly totals using the resample function. You have to tell the resample function to what frequency it needs to resample. Common ones are 'A' for yearly, 'M' for monthly, 'W' for weekly, 'D' for daily, and 'H' for hourly, but there are many other ones (see here. The keyword argument how is used to tell pandas how to compute the resampled data. This can be many things, like 'mean' for the mean (that is the default), 'sum' for the total, 'min', 'max', etc. The keyword argument kind is used to tell pandas where to assign the computed value to. You can assign the computed value to the last day of the period, or the first day, or to the entire period (in this case the entire month). The latter is done by specifying kind='period', which is what we will do here. Calculating the montly totals and making a bar graph can now be done with pandas as follows.



In [103]:

    
rain = read_csv('rotterdam_rainfall_2012.txt', skiprows=9,
                parse_dates=['YYYYMMDD'], index_col='YYYYMMDD',
                skipinitialspace=True)
rain.RH[rain.RH<0] = 0
rain.RH = rain.RH * 0.1 # Convert to mm/day
monthlyrain = rain.RH.resample('M', how='sum', kind='period')
print(monthlyrain)
monthlyrain.plot(kind='bar')
plt.ylabel('mm/month')
plt.xlabel('month')









    



YYYYMMDD
2012-01     83.0
2012-02     24.3
2012-03     21.9
2012-04     57.6
2012-05     76.5
2012-06    119.0
2012-07    121.6
2012-08     93.4
2012-09     52.0
2012-10    132.6
2012-11     63.3
2012-12    149.5
Freq: M, Name: RH, dtype: float64






    



/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:6: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()






    Out[103]:





<matplotlib.text.Text at 0x11549e390>

Exercise 5, Resample weather data

The file rotterdam_weather_2000_2010.txt contains daily weather data at the weather station Rotterdam for the period 2000-2010 (again from the KNMI). Open the data file in an editor to see what is in it. Perform the following tasks:

Load the data making sure the dates are used as index.
Convert the rain and evaporation to mm/day, and the temperature to degrees Celcius.
Set any negative rainfall (explained in the file) to zero.
Compute total yearly rainfall, total yearly evaporation, and mean yearly temperature.
Make a line plot of the yearly rainfall, yearly evaporation, and mean temperature using the plot function of pandas. Make sure to plot the mean temperature on the secondary $y$-axis (use the help function to find out how).



In [104]:

    
weather = read_csv('rotterdam_weather_2000_2010.txt',skiprows=11,
                parse_dates=['YYYYMMDD'],index_col='YYYYMMDD',skipinitialspace=True)
weather.TG = 0.1 * weather.TG
weather.RH = 0.1 * weather.RH
weather.EV24 = 0.1 * weather.EV24
weather.RH[weather.RH<0] = 0
yearly_rain = weather.RH.resample('A', how='sum', kind='period')
yearly_evap = weather.EV24.resample('A', how='sum', kind='period')
yearly_temp = weather.TG.resample('A', how='mean', kind='period')
yearly_rain.plot()
yearly_evap.plot()
yearly_temp.plot(secondary_y=True)
plt.xlabel('Year')
plt.ylabel('Rain/evap (mm/year)');









    



/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:7: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:8: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
/Users/Theo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:9: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()



In [ ]:

transport	car	bus	rail
country
France	86.1	5.3	8.6
Germany	85.2	7.1	7.7
Netherlands	86.4	4.6	9.0
United Kingdom	88.2	6.5	5.3

transport	car	BUS	train	public_transport
somewhere in Europe
France	86.1	5.3	8.6	13.9
Germany	85.2	7.1	7.7	14.8
Netherlands	86.4	4.6	9.0	13.6
United Kingdom	88.2	6.5	5.3	11.8

	#	Country (or dependency)	Population	Yearly	Net	Density	Land Area	Migrants	Fert.	Med.	Urban
	Unnamed: 0_level_1	Unnamed: 1_level_1	-2017	Change	Change	(P/Km2)	(Km2)	(net)	Rate	Age	Pop %
0	1	China	1388232693	0.43%	5909361	148	9386293	-360000.0	1.55	37	57.60%
1	2	India	1342512706	1.18%	15711130	452	2973450	-519644.0	2.48	27	32%
2	3	U.S.	326474013	0.73%	2355226	36	9144930	1001577.0	1.89	38	82.10%
3	4	Indonesia	263510146	1.12%	2929046	146	1811066	-140000.0	2.5	28	53.40%
4	5	Brazil	211243220	0.80%	1675300	25	8349534	3185.0	1.82	31	83.50%
5	6	Pakistan	196744376	2.03%	3917874	255	770942	-216384.0	3.72	23	38.10%
6	7	Nigeria	191835936	2.59%	4848373	211	910902	-60000.0	5.74	18	47.80%
7	8	Bangladesh	164827718	1.18%	1916854	1266	130175	-445296.0	2.23	26	34.50%
8	9	Russia	143375006	-0.05%	-64826	9	16292614	223577.0	1.66	39	73.20%
9	10	Mexico	130222815	1.24%	1590811	67	1943624	-104717.0	2.29	27	77.30%
10	11	Japan	126045211	-0.22%	-278504	346	364503	70000.0	1.4	47	94.30%
11	12	Ethiopia	104344901	2.45%	2491633	104	1000430	-12000.0	4.59	19	19.40%
12	13	Philippines	103796832	1.51%	1546699	348	298181	-140000.0	3.04	24	44.20%
13	14	Viet Nam	95414640	1.03%	970440	308	310090	-40000.0	1.96	30	33.80%
14	15	Egypt	95215102	1.96%	1831528	96	994933	-43136.0	3.38	25	39%
15	16	DR Congo	82242685	3.16%	2520061	36	2265639	-19184.0	6.15	17	38.30%
16	17	Iran	80945718	1.13%	902572	50	1628686	-60000.0	1.75	30	73.40%
17	18	Germany	80636124	-0.06%	-46227	231	348621	250000.0	1.39	46	77.20%
18	19	Turkey	80417526	1%	795464	105	769546	400001.0	2.1	30	71.20%
19	20	Thailand	68297547	0.22%	150938	134	510827	20000.0	1.53	38	51%
20	21	U.K.	65511098	0.61%	399955	271	241917	180000.0	1.92	40	81.20%
21	22	France	64938716	0.42%	270587	119	547544	66311.0	2	41	80.20%
22	23	Italy	59797978	-0.01%	-3026	203	294137	105654.0	1.43	46	70.70%
23	24	Tanzania	56877529	3.12%	1722056	64	885943	-40000.0	5.24	17	30.60%
24	25	South Africa	55436360	0.83%	457453	46	1213049	120000.0	2.4	26	63.40%
25	26	Myanmar	54836483	0.87%	473057	84	653593	-94856.0	2.25	28	34.50%
26	27	South Korea	50704971	0.40%	201038	522	97229	60000.0	1.26	41	81.40%
27	28	Colombia	49067981	0.85%	413589	44	1110135	-29000.0	1.93	30	78.40%
28	29	Kenya	48466928	2.57%	1215479	85	568861	-10000.0	4.44	19	25.80%
29	30	Spain	46070146	0.01%	5542	92	498595	-118614.0	1.32	43	82%
...	...	...	...	...	...	...	...	...	...	...	...
203	204	Andorra	68728	-0.63%	-437	146	470	NaN	N.A.	N.A.	N.A.
204	205	Cayman Islands	61557	1.31%	793	257	240	NaN	N.A.	N.A.	98.50%
205	206	Bermuda	61352	-0.50%	-310	1227	50	NaN	N.A.	N.A.	N.A.
206	207	Saint Kitts and Nevis	56780	1.06%	597	218	260	NaN	N.A.	N.A.	31.70%
207	208	Greenland	56239	0.08%	43	0	562390	NaN	N.A.	N.A.	88.50%
208	209	American Samoa	55653	0.09%	51	278	200	NaN	N.A.	N.A.	87.40%
209	210	Northern Mariana Islands	55567	0.32%	178	121	460	NaN	N.A.	N.A.	89%
210	211	Marshall Islands	53132	0.12%	63	295	180	NaN	N.A.	N.A.	73.20%
211	212	Faeroe Islands	48335	0.20%	96	35	1397	NaN	N.A.	N.A.	43.30%
212	213	Sint Maarten	40117	1.46%	579	1180	34	NaN	N.A.	N.A.	N.A.
213	214	Liechtenstein	38022	0.65%	246	238	160	NaN	N.A.	N.A.	14.20%
214	215	Monaco	38010	0.39%	147	25510	1	NaN	N.A.	N.A.	N.A.
215	216	Turks and Caicos Islands	35442	1.54%	538	37	950	NaN	N.A.	N.A.	91.10%
216	217	Gibraltar	32472	0.31%	99	3247	10	NaN	N.A.	N.A.	90.40%
217	218	San Marino	32104	0.48%	154	535	60	NaN	N.A.	N.A.	93.70%
218	219	British Virgin Islands	31200	1.76%	541	208	150	NaN	N.A.	N.A.	43.30%
219	220	Caribbean Netherlands	25699	1.46%	371	78	328	NaN	N.A.	N.A.	58.90%
220	221	Palau	21726	1.05%	225	47	460	NaN	N.A.	N.A.	86.70%
221	222	Cook Islands	21069	0.58%	121	88	240	NaN	N.A.	N.A.	74.30%
222	223	Anguilla	14906	0.97%	143	166	90	NaN	N.A.	N.A.	N.A.
223	224	Wallis and Futuna	13090	-0.17%	-22	94	140	NaN	N.A.	N.A.	0%
224	225	Nauru	10301	0.37%	38	515	20	NaN	N.A.	N.A.	98.80%
225	226	Tuvalu	9975	0.32%	32	333	30	NaN	N.A.	N.A.	60.40%
226	227	Saint Pierre and Miquelon	6319	0.29%	18	28	230	NaN	N.A.	N.A.	86.60%
227	228	Montserrat	5179	0.49%	25	52	100	NaN	N.A.	N.A.	9.10%
228	229	Saint Helena	3970	0.35%	14	10	389	NaN	N.A.	N.A.	41.20%
229	230	Falkland Islands	2919	0.24%	7	0	14595	NaN	N.A.	N.A.	80.50%
230	231	Niue	1614	0.12%	2	6	260	NaN	N.A.	N.A.	33.30%
231	232	Tokelau	1300	1.88%	24	130	10	NaN	N.A.	N.A.	0%
232	233	Holy See	801	0%	0	1821	0	NaN	N.A.	N.A.	N.A.