After taking a look at WDI_Data.csv, it seems many of the indicators are missing significant amounts of data. Let's go through the the data set and take count of how often each indicator is non-empty
In [10]:
import csv
count_dict = {}
with open('WDI_Data.csv', 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
idx = 0
for row in csv_reader:
# skip first row
if (idx == 0):
idx += 1
continue
# indicator name for this row
ind_name = row[3]
row_idx = 0 # where we are at within row
for item in row:
if row_idx > 3 and item != '':
# this is a valid, non-empty entry
if ind_name not in count_dict:
count_dict[ind_name] = 0
count_dict[ind_name] += 1
row_idx += 1
idx += 1
We've counted the number of entries for each indicator and stored them in count_dict
. Let's sort the dictionary by value and print the top 25 entries.
In [15]:
import operator
sorted_dict = sorted(count_dict.items(), key=operator.itemgetter(1))
# descending order would be nicer
sorted_dict.reverse()
# print the top 25
idx = 0
for key in sorted_dict:
if idx > 24:
break
print key
idx += 1
These indicators seem to be very densely populated in our dataset. It would be interesting to use some of these indicators and build a model to predict population growth rate.
In [ ]: