Cleaning of data


Cleaning of data with Pandas

In this notebook I'm giving some examples on how you can clean your data using Pandas

Pandas allows you to clean your data and create interesting views and statistics on it.

Some setup:


In [3]:
import pandas as pd
import numpy as np
doc = pd.read_excel('/home/rick/Downloads/ADS-GC/Portfolio/Original Assignments/GRAIN---Land-grab-deals---Jan-2012.xls')

doc[:3]


Out[3]:
Landgrabbed Landgrabber Base Sector Hectares Production Projected investment Status of deal Summary
0 Algeria Al Qudra UAE Finance, real estate 31000.0 Milk, olive oil, potatoes NaN Done Al Qudra Holding is a joint-stock company esta...
1 Angola CAMC Engineering Co. Ltd China Construction 1500.0 Rice US$77 million Done CAMCE is a subsidiary of the China National Ma...
2 Angola ENI Italy Energy 12000.0 Oil palm NaN In process The project is a joint venture between Sonango...

Mapping the statusus to valid values.

There are some values in this dataset that are not consitent.

This code will fix it to some known good values


In [4]:
# fix Status of deal
valid_statuses = {'done': 'done', 
                  'suspended': 'suspended', 
                  'proposed': 'proposed', 
                  'in process': 'in process',
                  'signed': 'in process'}


def fix_status_of_deal(field: str):
    for value, key in valid_statuses.items():
        if value in field.lower():
            return key
    return field.strip()


doc['Status of deal'] = doc['Status of deal'].map(fix_status_of_deal)
doc['Status of deal'].unique()


Out[4]:
array(['done', 'in process', 'suspended', 'proposed'], dtype=object)

Searching for more issues

Now we are checking for missing data points.

For some columns this will not realy matter, for others it does.


In [9]:
import pandas as pd
for column in doc.columns:
    col = doc[column]  # type: pd.Series
    print('%s: %s' % (column, col.isnull().sum()))
    # could go through the effort to replace it with 'Missing' but that is actually less useful then NaN/null


Landgrabbed: 0
Landgrabber: 0
Base: 0
Sector: 10
Hectares: 2
Production: 34
Projected investment: 310
Status of deal: 0
Summary: 0

Fix the numeric values to actually be numeric

The numeric values are filled it by had. And that causes some to be in different formats.

This code detects the format and parses to a numeric value.

In the end, there are still some values that are not parsable, and need to be fixed by hand.


In [10]:
# Make Project investment numeric
import re


re_avg = re.compile(r'(\d+)-(\d+)(E\d+)')

def fixnumb(inp: str):
    if isinstance(inp, float):
        return inp
    if not inp or not inp.strip():
        return ''
    x = inp.upper().replace('US$', '').replace(' ', '').replace(',', '.')
    x = x.replace('BILLION', 'E9').replace('MILLION', 'E6')
    try:
        return float(x)
    except ValueError as e:
        if re_avg.match(x):
            res = re_avg.search(x)
            left, right, sin = res.groups()
            left = float(left)
            right = float(right)
            avg = (left+right)/2
            try:
                return float(repr(avg) + sin)
            except ValueError as e:
                print('x: %r, a: %r, %s %s' % (x, avg, e, inp))
                return inp
        print('x: %r, %s %s' % (x, e, inp))
        return inp

doc['Projected investment'] = doc['Projected investment'].map(fixnumb)
doc[:3]


x: '8/HA/YR(LEASE)', could not convert string to float: '8/HA/YR(LEASE)' US$8/ha/yr (lease)
x: '4/HA/YR(LEASE)', could not convert string to float: '4/HA/YR(LEASE)' US$4/ha/yr (lease)
x: '1.2/HA/YR(AFTERFIRST7YEARS)INGAMBELAAND8/HA/YR(AFTERFIRST6YEARS)INBAKO', could not convert string to float: '1.2/HA/YR(AFTERFIRST7YEARS)INGAMBELAAND8/HA/YR(AFTERFIRST6YEARS)INBAKO' US$1.2/ha/yr (after first 7 years) in Gambela and US$8/ha/yr (after first 6 years) in Bako
x: '4E6(LEASECOSTFOR25.000HA)', could not convert string to float: '4E6(LEASECOSTFOR25.000HA)' US$4 million (lease cost for 25,000 ha)
x: '4/HA/YR(LEASE)', could not convert string to float: '4/HA/YR(LEASE)' US$4/ha/yr (lease)
x: '57.600(0.80/HA)', could not convert string to float: '57.600(0.80/HA)' US$57,600 (US$0.80/ha)
x: '205E6(HALFOFFUND)', could not convert string to float: '205E6(HALFOFFUND)' US$205 million (half of fund)
x: '205E6(HALFOFFUND)', could not convert string to float: '205E6(HALFOFFUND)' US$205 million (half of fund)
x: '125.000/YR(LANDLEASE)', could not convert string to float: '125.000/YR(LANDLEASE)' US$125,000/yr (land lease)
Out[10]:
Landgrabbed Landgrabber Base Sector Hectares Production Projected investment Status of deal Summary
0 Algeria Al Qudra UAE Finance, real estate 31000.0 Milk, olive oil, potatoes NaN done Al Qudra Holding is a joint-stock company esta...
1 Angola CAMC Engineering Co. Ltd China Construction 1500.0 Rice 7.7e+07 done CAMCE is a subsidiary of the China National Ma...
2 Angola ENI Italy Energy 12000.0 Oil palm NaN in process The project is a joint venture between Sonango...

Fix the production column

There are a couple of production values in different formats.

This code splits it using a regex, and uses difflib to automatically match it to the closest known value.

When it doesn't know the product, it will add it to the known values.


In [11]:
import difflib


re_split = re.compile(r'(?:,|&|;|and|\n|\([^)]+\))')
options = []


def fix_production(x: str):
    # already parsed
    if isinstance(x, list):
        return x
    
    # empty, integer, float, etc.
    if type(x) != str:
        return []
    
    # Split the text into words, ignoring 'and' and inside braces
    x = [y.strip() for y in re_split.split(x.lower()) if y.strip()]
    y = []
    for part in x:
        # Check if we already know a similar word, if not add it otherwise use the known word
        matches = difflib.get_close_matches(part, options, n=1)
        if not matches:
            options.append(part)
            y.append(part)
        else:
            y.append(matches[0])
    return y
    

doc['Production'] = doc['Production'].map(fix_production)
doc[:10]


Out[11]:
Landgrabbed Landgrabber Base Sector Hectares Production Projected investment Status of deal Summary
0 Algeria Al Qudra UAE Finance, real estate 31000.0 [milk, olive oil, potatoes] NaN done Al Qudra Holding is a joint-stock company esta...
1 Angola CAMC Engineering Co. Ltd China Construction 1500.0 [rice] 7.7e+07 done CAMCE is a subsidiary of the China National Ma...
2 Angola ENI Italy Energy 12000.0 [oil palm] NaN in process The project is a joint venture between Sonango...
3 Angola AfriAgro Portugal Finance, real estate 5000.0 [oil palm] 3.25e+07 done AfriAgro is a subsidiary of the Portugal-based...
4 Angola Eurico Ferreira Portugal Energy, telecommunications\n 30000.0 [sugar cane] 2e+08 done In 2008, Portuguese conglomerate Eurico Ferrei...
5 Angola Quifel Natural Resources Portugal Agribusiness, energy 10000.0 [oilseed] NaN done Quifel Natural Resources is part of Portugal's...
6 Angola Lonrho UK Agribusiness 25000.0 [rice] NaN done In 2005, all that remained of Lonrho, once one...
7 Argentina Grupo Maggi Brazil Agribusiness 7000.0 [soybeans] NaN done Grupo Maggi, controlled by Blairo Maggi, one o...
8 Argentina Beidahuang China Agribusiness 320000.0 [maize, soybeans, wheat] 1.5e+06 suspended State-owned Beidahuang is the largest farming ...
9 Argentina Ingleby Company Denmark Finance 12433.0 [barley, maize, soybeans, sunflower, wheat] NaN done The Ingleby Company, which is owned by the Rau...

Pivot Table

A pivot table is a realy good way to visualise different values.

In this table you can see the hectares per land grabber per country.


In [5]:
pd.pivot_table(doc, values=['Hectares'], index=['Base', 'Landgrabber'])


Out[5]:
Hectares
Base Landgrabber
Argentina Cresud 111333.333333
El Tejar 190000.000000
Hillock Capital Management 9000.000000
Ingacot Group 1000.000000
Los Grobo 52766.666667
Australia BKK Partners 100000.000000
The Trust Company Limited 13691.000000
Bahrain Hassan Group 10000.000000
Bangladesh Bangladesh 20200.000000
Bhati Bangla Agrotec 30000.000000
Nitol-Niloy Group 10000.000000
Belgium FELISA 4258.000000
SIAT 107300.000000
Bermuda NFD Agro 34300.000000
Brazil Brazil Agro Business Group 5000.000000
Grupo Maggi 7000.000000
JBS 1876.000000
Monica Semillas 13000.000000
Petro Buzi 40000.000000
Pinosso Group 100000.000000
Vale-Embrapa 30000.000000
Brunei Brunei Investment Authority 10000.000000
Bulgaria Ceres 21400.000000
Canada Alberta Investment Management Company 252000.000000
Brookfield Asset Management 97124.000000
Canadian Economic Development Assistance for Southern Sudan (CEDASS) 12200.000000
Feronia Inc. 110000.000000
Hancock 47715.000000
SeedRock Africa Agriculture 40000.000000
Cayman Islands Nagathom Fund 2200.000000
... ... ...
US Aslan Global Management 20166.666667
BDFC Ethiopia 17400.000000
Black River Asset Management 70000.000000
Bunge 10000.000000
Bunge 25000.000000
CAMS Group 20000.000000
Dominion Farms 18000.000000
Elana Agricultural Land Opportunity Fund 29320.000000
Galtere 25000.000000
Grain Alliance 40000.000000
Harvard Management Company 1760.000000
Herakles Capital\n 38682.000000
Jarch Management 400000.000000
Jim Rogers Fund 80000.000000
Kyiv-Atlantic Ukraine 10000.000000
Maple Energy 13500.000000
Millennium Challenge Corporation 22441.000000
NCH Capital 350000.000000
Nile Trading and Development Inc. 600000.000000
Sollus Capital 35000.000000
Southern Global Inc. 30000.000000
TM Plantations 50000.000000
Teachers Insurance and Annuity Association - College Retirement Equities Fund (TIAA-CREF) 248500.000000
Tiba Agro 320000.000000
World Bank 29409.000000
US Black River Asset Management 2100.000000
Vietnam Long Van 28 Company 200000.000000
Vietnam Africa Agricultural Development Company\n 10000.000000
Vietnamese investors 4000.000000
West Africa UEMOA 11288.000000

312 rows × 1 columns