Cleaning of data with Pandas

In this notebook I'm giving some examples on how you can clean your data using Pandas

Pandas allows you to clean your data and create interesting views and statistics on it.

Some setup:



In [3]:

    
import pandas as pd
import numpy as np
doc = pd.read_excel('/home/rick/Downloads/ADS-GC/Portfolio/Original Assignments/GRAIN---Land-grab-deals---Jan-2012.xls')

doc[:3]









    Out[3]:







  
    
      
      Landgrabbed
      Landgrabber
      Base
      Sector
      Hectares
      Production
      Projected investment
      Status of deal
      Summary
    
  
  
    
      0
      Algeria
      Al Qudra
      UAE
      Finance, real estate
      31000.0
      Milk, olive oil, potatoes
      NaN
      Done
      Al Qudra Holding is a joint-stock company esta...
    
    
      1
      Angola
      CAMC Engineering Co. Ltd
      China
      Construction
      1500.0
      Rice
      US$77 million
      Done
      CAMCE is a subsidiary of the China National Ma...
    
    
      2
      Angola
      ENI
      Italy
      Energy
      12000.0
      Oil palm
      NaN
      In process
      The project is a joint venture between Sonango...

Mapping the statusus to valid values.

There are some values in this dataset that are not consitent.

This code will fix it to some known good values



In [4]:

    
# fix Status of deal
valid_statuses = {'done': 'done', 
                  'suspended': 'suspended', 
                  'proposed': 'proposed', 
                  'in process': 'in process',
                  'signed': 'in process'}


def fix_status_of_deal(field: str):
    for value, key in valid_statuses.items():
        if value in field.lower():
            return key
    return field.strip()


doc['Status of deal'] = doc['Status of deal'].map(fix_status_of_deal)
doc['Status of deal'].unique()









    Out[4]:





array(['done', 'in process', 'suspended', 'proposed'], dtype=object)

Searching for more issues

Now we are checking for missing data points.

For some columns this will not realy matter, for others it does.



In [9]:

    
import pandas as pd
for column in doc.columns:
    col = doc[column]  # type: pd.Series
    print('%s: %s' % (column, col.isnull().sum()))
    # could go through the effort to replace it with 'Missing' but that is actually less useful then NaN/null









    



Landgrabbed: 0
Landgrabber: 0
Base: 0
Sector: 10
Hectares: 2
Production: 34
Projected investment: 310
Status of deal: 0
Summary: 0

Fix the numeric values to actually be numeric

The numeric values are filled it by had. And that causes some to be in different formats.

This code detects the format and parses to a numeric value.

In the end, there are still some values that are not parsable, and need to be fixed by hand.



In [10]:

    
# Make Project investment numeric
import re


re_avg = re.compile(r'(\d+)-(\d+)(E\d+)')

def fixnumb(inp: str):
    if isinstance(inp, float):
        return inp
    if not inp or not inp.strip():
        return ''
    x = inp.upper().replace('US$', '').replace(' ', '').replace(',', '.')
    x = x.replace('BILLION', 'E9').replace('MILLION', 'E6')
    try:
        return float(x)
    except ValueError as e:
        if re_avg.match(x):
            res = re_avg.search(x)
            left, right, sin = res.groups()
            left = float(left)
            right = float(right)
            avg = (left+right)/2
            try:
                return float(repr(avg) + sin)
            except ValueError as e:
                print('x: %r, a: %r, %s %s' % (x, avg, e, inp))
                return inp
        print('x: %r, %s %s' % (x, e, inp))
        return inp

doc['Projected investment'] = doc['Projected investment'].map(fixnumb)
doc[:3]









    



x: '8/HA/YR(LEASE)', could not convert string to float: '8/HA/YR(LEASE)' US$8/ha/yr (lease)
x: '4/HA/YR(LEASE)', could not convert string to float: '4/HA/YR(LEASE)' US$4/ha/yr (lease)
x: '1.2/HA/YR(AFTERFIRST7YEARS)INGAMBELAAND8/HA/YR(AFTERFIRST6YEARS)INBAKO', could not convert string to float: '1.2/HA/YR(AFTERFIRST7YEARS)INGAMBELAAND8/HA/YR(AFTERFIRST6YEARS)INBAKO' US$1.2/ha/yr (after first 7 years) in Gambela and US$8/ha/yr (after first 6 years) in Bako
x: '4E6(LEASECOSTFOR25.000HA)', could not convert string to float: '4E6(LEASECOSTFOR25.000HA)' US$4 million (lease cost for 25,000 ha)
x: '4/HA/YR(LEASE)', could not convert string to float: '4/HA/YR(LEASE)' US$4/ha/yr (lease)
x: '57.600(0.80/HA)', could not convert string to float: '57.600(0.80/HA)' US$57,600 (US$0.80/ha)
x: '205E6(HALFOFFUND)', could not convert string to float: '205E6(HALFOFFUND)' US$205 million (half of fund)
x: '205E6(HALFOFFUND)', could not convert string to float: '205E6(HALFOFFUND)' US$205 million (half of fund)
x: '125.000/YR(LANDLEASE)', could not convert string to float: '125.000/YR(LANDLEASE)' US$125,000/yr (land lease)






    Out[10]:







  
    
      
      Landgrabbed
      Landgrabber
      Base
      Sector
      Hectares
      Production
      Projected investment
      Status of deal
      Summary
    
  
  
    
      0
      Algeria
      Al Qudra
      UAE
      Finance, real estate
      31000.0
      Milk, olive oil, potatoes
      NaN
      done
      Al Qudra Holding is a joint-stock company esta...
    
    
      1
      Angola
      CAMC Engineering Co. Ltd
      China
      Construction
      1500.0
      Rice
      7.7e+07
      done
      CAMCE is a subsidiary of the China National Ma...
    
    
      2
      Angola
      ENI
      Italy
      Energy
      12000.0
      Oil palm
      NaN
      in process
      The project is a joint venture between Sonango...

Fix the production column

There are a couple of production values in different formats.

This code splits it using a regex, and uses difflib to automatically match it to the closest known value.

When it doesn't know the product, it will add it to the known values.



In [11]:

    
import difflib


re_split = re.compile(r'(?:,|&|;|and|\n|\([^)]+\))')
options = []


def fix_production(x: str):
    # already parsed
    if isinstance(x, list):
        return x
    
    # empty, integer, float, etc.
    if type(x) != str:
        return []
    
    # Split the text into words, ignoring 'and' and inside braces
    x = [y.strip() for y in re_split.split(x.lower()) if y.strip()]
    y = []
    for part in x:
        # Check if we already know a similar word, if not add it otherwise use the known word
        matches = difflib.get_close_matches(part, options, n=1)
        if not matches:
            options.append(part)
            y.append(part)
        else:
            y.append(matches[0])
    return y
    

doc['Production'] = doc['Production'].map(fix_production)
doc[:10]









    Out[11]:







  
    
      
      Landgrabbed
      Landgrabber
      Base
      Sector
      Hectares
      Production
      Projected investment
      Status of deal
      Summary
    
  
  
    
      0
      Algeria
      Al Qudra
      UAE
      Finance, real estate
      31000.0
      [milk, olive oil, potatoes]
      NaN
      done
      Al Qudra Holding is a joint-stock company esta...
    
    
      1
      Angola
      CAMC Engineering Co. Ltd
      China
      Construction
      1500.0
      [rice]
      7.7e+07
      done
      CAMCE is a subsidiary of the China National Ma...
    
    
      2
      Angola
      ENI
      Italy
      Energy
      12000.0
      [oil palm]
      NaN
      in process
      The project is a joint venture between Sonango...
    
    
      3
      Angola
      AfriAgro
      Portugal
      Finance, real estate
      5000.0
      [oil palm]
      3.25e+07
      done
      AfriAgro is a subsidiary of the Portugal-based...
    
    
      4
      Angola
      Eurico Ferreira
      Portugal
      Energy, telecommunications\n
      30000.0
      [sugar cane]
      2e+08
      done
      In 2008, Portuguese conglomerate Eurico Ferrei...
    
    
      5
      Angola
      Quifel Natural Resources
      Portugal
      Agribusiness, energy
      10000.0
      [oilseed]
      NaN
      done
      Quifel Natural Resources is part of Portugal's...
    
    
      6
      Angola
      Lonrho
      UK
      Agribusiness
      25000.0
      [rice]
      NaN
      done
      In 2005, all that remained of Lonrho, once one...
    
    
      7
      Argentina
      Grupo Maggi
      Brazil
      Agribusiness
      7000.0
      [soybeans]
      NaN
      done
      Grupo Maggi, controlled by Blairo Maggi, one o...
    
    
      8
      Argentina
      Beidahuang
      China
      Agribusiness
      320000.0
      [maize, soybeans, wheat]
      1.5e+06
      suspended
      State-owned Beidahuang is the largest farming ...
    
    
      9
      Argentina
      Ingleby Company
      Denmark
      Finance
      12433.0
      [barley, maize, soybeans, sunflower, wheat]
      NaN
      done
      The Ingleby Company, which is owned by the Rau...

Pivot Table

A pivot table is a realy good way to visualise different values.

In this table you can see the hectares per land grabber per country.



In [5]:

    
pd.pivot_table(doc, values=['Hectares'], index=['Base', 'Landgrabber'])









    Out[5]:







  
    
      
      
      Hectares
    
    
      Base
      Landgrabber
      
    
  
  
    
      Argentina
      Cresud
      111333.333333
    
    
      El Tejar
      190000.000000
    
    
      Hillock Capital Management
      9000.000000
    
    
      Ingacot Group
      1000.000000
    
    
      Los Grobo
      52766.666667
    
    
      Australia
      BKK Partners
      100000.000000
    
    
      The Trust Company Limited
      13691.000000
    
    
      Bahrain
      Hassan Group
      10000.000000
    
    
      Bangladesh
      Bangladesh
      20200.000000
    
    
      Bhati Bangla Agrotec
      30000.000000
    
    
      Nitol-Niloy Group
      10000.000000
    
    
      Belgium
      FELISA
      4258.000000
    
    
      SIAT
      107300.000000
    
    
      Bermuda
      NFD Agro
      34300.000000
    
    
      Brazil
      Brazil Agro Business Group
      5000.000000
    
    
      Grupo Maggi
      7000.000000
    
    
      JBS
      1876.000000
    
    
      Monica Semillas
      13000.000000
    
    
      Petro Buzi
      40000.000000
    
    
      Pinosso Group
      100000.000000
    
    
      Vale-Embrapa
      30000.000000
    
    
      Brunei
      Brunei Investment Authority
      10000.000000
    
    
      Bulgaria
      Ceres
      21400.000000
    
    
      Canada
      Alberta Investment Management Company
      252000.000000
    
    
      Brookfield Asset Management
      97124.000000
    
    
      Canadian Economic Development Assistance for Southern Sudan (CEDASS)
      12200.000000
    
    
      Feronia Inc.
      110000.000000
    
    
      Hancock
      47715.000000
    
    
      SeedRock Africa Agriculture
      40000.000000
    
    
      Cayman Islands
      Nagathom Fund
      2200.000000
    
    
      ...
      ...
      ...
    
    
      US
      Aslan Global Management
      20166.666667
    
    
      BDFC Ethiopia
      17400.000000
    
    
      Black River Asset Management
      70000.000000
    
    
      Bunge
      10000.000000
    
    
      Bunge
      25000.000000
    
    
      CAMS Group
      20000.000000
    
    
      Dominion Farms
      18000.000000
    
    
      Elana Agricultural Land Opportunity Fund
      29320.000000
    
    
      Galtere
      25000.000000
    
    
      Grain Alliance
      40000.000000
    
    
      Harvard Management Company
      1760.000000
    
    
      Herakles Capital\n
      38682.000000
    
    
      Jarch Management
      400000.000000
    
    
      Jim Rogers Fund
      80000.000000
    
    
      Kyiv-Atlantic Ukraine
      10000.000000
    
    
      Maple Energy
      13500.000000
    
    
      Millennium Challenge Corporation
      22441.000000
    
    
      NCH Capital
      350000.000000
    
    
      Nile Trading and Development Inc.
      600000.000000
    
    
      Sollus Capital
      35000.000000
    
    
      Southern Global Inc.
      30000.000000
    
    
      TM Plantations
      50000.000000
    
    
      Teachers Insurance and Annuity Association - College Retirement Equities Fund (TIAA-CREF)
      248500.000000
    
    
      Tiba Agro
      320000.000000
    
    
      World Bank
      29409.000000
    
    
      US
      Black River Asset Management
      2100.000000
    
    
      Vietnam
      Long Van 28 Company
      200000.000000
    
    
      Vietnam Africa Agricultural Development Company\n
      10000.000000
    
    
      Vietnamese investors
      4000.000000
    
    
      West Africa
      UEMOA
      11288.000000
    
  

312 rows × 1 columns

	Landgrabbed	Landgrabber	Base	Sector	Hectares	Production	Projected investment	Status of deal	Summary
0	Algeria	Al Qudra	UAE	Finance, real estate	31000.0	Milk, olive oil, potatoes	NaN	Done	Al Qudra Holding is a joint-stock company esta...
1	Angola	CAMC Engineering Co. Ltd	China	Construction	1500.0	Rice	US$77 million	Done	CAMCE is a subsidiary of the China National Ma...
2	Angola	ENI	Italy	Energy	12000.0	Oil palm	NaN	In process	The project is a joint venture between Sonango...

		Hectares
Base	Landgrabber
Argentina	Cresud	111333.333333
	El Tejar	190000.000000
	Hillock Capital Management	9000.000000
	Ingacot Group	1000.000000
	Los Grobo	52766.666667
Australia	BKK Partners	100000.000000
Australia	The Trust Company Limited	13691.000000
Bahrain	Hassan Group	10000.000000
Bangladesh	Bangladesh	20200.000000
	Bhati Bangla Agrotec	30000.000000
	Nitol-Niloy Group	10000.000000
Belgium	FELISA	4258.000000
Belgium	SIAT	107300.000000
Bermuda	NFD Agro	34300.000000
Brazil	Brazil Agro Business Group	5000.000000
	Grupo Maggi	7000.000000
	JBS	1876.000000
	Monica Semillas	13000.000000
	Petro Buzi	40000.000000
	Pinosso Group	100000.000000
	Vale-Embrapa	30000.000000
Brunei	Brunei Investment Authority	10000.000000
Bulgaria	Ceres	21400.000000
Canada	Alberta Investment Management Company	252000.000000
	Brookfield Asset Management	97124.000000
	Canadian Economic Development Assistance for Southern Sudan (CEDASS)	12200.000000
	Feronia Inc.	110000.000000
	Hancock	47715.000000
	SeedRock Africa Agriculture	40000.000000
Cayman Islands	Nagathom Fund	2200.000000
...	...	...
US	Aslan Global Management	20166.666667
	BDFC Ethiopia	17400.000000
	Black River Asset Management	70000.000000
	Bunge	10000.000000
	Bunge	25000.000000
	CAMS Group	20000.000000
	Dominion Farms	18000.000000
	Elana Agricultural Land Opportunity Fund	29320.000000
	Galtere	25000.000000
	Grain Alliance	40000.000000
	Harvard Management Company	1760.000000
	Herakles Capital\n	38682.000000
	Jarch Management	400000.000000
	Jim Rogers Fund	80000.000000
	Kyiv-Atlantic Ukraine	10000.000000
	Maple Energy	13500.000000
	Millennium Challenge Corporation	22441.000000
	NCH Capital	350000.000000
	Nile Trading and Development Inc.	600000.000000
	Sollus Capital	35000.000000
	Southern Global Inc.	30000.000000
	TM Plantations	50000.000000
	Teachers Insurance and Annuity Association - College Retirement Equities Fund (TIAA-CREF)	248500.000000
	Tiba Agro	320000.000000
	World Bank	29409.000000
US	Black River Asset Management	2100.000000
Vietnam	Long Van 28 Company	200000.000000
	Vietnam Africa Agricultural Development Company\n	10000.000000
	Vietnamese investors	4000.000000
West Africa	UEMOA	11288.000000