After taking a look at WDI_Data.csv, it seems many of the indicators are missing significant amounts of data. Let's go through the the data set and take count of how often each indicator is non-empty


In [10]:
import csv

count_dict = {}

with open('WDI_Data.csv', 'rb') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    idx = 0
    for row in csv_reader:
        # skip first row
        if (idx == 0):
            idx += 1
            continue
        # indicator name for this row
        ind_name = row[3]
        row_idx = 0 # where we are at within row
        for item in row:
            if row_idx > 3 and item != '':
                # this is a valid, non-empty entry
                if ind_name not in count_dict:
                    count_dict[ind_name] = 0
                count_dict[ind_name] += 1
            row_idx += 1
        idx += 1

We've counted the number of entries for each indicator and stored them in count_dict. Let's sort the dictionary by value and print the top 25 entries.


In [15]:
import operator
sorted_dict = sorted(count_dict.items(), key=operator.itemgetter(1))
# descending order would be nicer
sorted_dict.reverse()
# print the top 25
idx = 0
for key in sorted_dict:
    if idx > 24:
        break
    print key
    idx += 1


('SP.POP.TOTL', 14623)
('SP.POP.GROW', 14569)
('SP.RUR.TOTL.ZS', 14552)
('SP.URB.TOTL.IN.ZS', 14552)
('SP.URB.TOTL', 14511)
('SP.RUR.TOTL', 14511)
('SP.URB.GROW', 14461)
('AG.SRF.TOTL.K2', 14185)
('AG.LND.TOTL.K2', 14180)
('SP.RUR.TOTL.ZG', 14122)
('EN.POP.DNST', 14111)
('SP.DYN.CBRT.IN', 13480)
('SP.ADO.TFRT', 13440)
('SP.DYN.CDRT.IN', 13440)
('SP.POP.1564.TO.ZS', 13378)
('SP.POP.TOTL.FE.ZS', 13378)
('SP.POP.0014.TO.ZS', 13378)
('SP.POP.65UP.TO.ZS', 13378)
('SP.POP.DPND.YG', 13375)
('SP.POP.DPND.OL', 13375)
('SP.POP.DPND', 13375)
('SP.DYN.TFRT.IN', 13274)
('SP.DYN.LE00.FE.IN', 13253)
('SP.DYN.LE00.MA.IN', 13253)
('SP.DYN.LE00.IN', 13253)

These indicators seem to be very densely populated in our dataset. It would be interesting to use some of these indicators and build a model to predict population growth rate.


In [ ]: