Merging Data

We will use this dataset from the EU member states trades for this notebook:

http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ext_lt_invcur&lang=en


In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame()

The data set is too large to be read in 1 go without risking network interruption etc. Let's read it in chunk.

Data contained in pandas objects can be combined together in a number of built-in ways:

  • pandas.concat glues or stacks together objects along an axis.
  • pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
  • combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

In [3]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    df = pd.concat([df, chunk])

In [4]:
df.head()


Out[4]:
partner,currency,stk_flow,sitc06,geo\time 2014 2012 2010
0 EXT_EU,EUR,EXP,SITC0-4A,AT 61.9 65.6 67
1 EXT_EU,EUR,EXP,SITC0-4A,BE 53.8 85.8 92.4
2 EXT_EU,EUR,EXP,SITC0-4A,BG 57.0 46.2 54.1
3 EXT_EU,EUR,EXP,SITC0-4A,CY 79.1 60.7 61.4
4 EXT_EU,EUR,EXP,SITC0-4A,CZ 58.3 66.7 59.1

The data set read in is messy. Notice that column 2 looks like can be further transformed during the loading phase.

Let's be thoughtful in how we "parse" the dataset.


In [8]:
df = pd.DataFrame()

Let's split out columns with "," to separate various sub-columns


In [9]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    data_cols = chunk.columns[0].split(',')
    print(data_rows[:5], data_cols)
    break
    df = pd.concat([df, chunk])


([['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'AT'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BE'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BG'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'CY'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'CZ']], ['partner', 'currency', 'stk_flow', 'sitc06', 'geo\\time'])

There is still issue with the "geo\time"


In [10]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    # separate out all the data rows for column index 0
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    
    # show me column indexed 0 and split on ","
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    
    # create new data frame with these new rows and columns
    clean_df = pd.DataFrame(data_rows, columns=data_cols)
    new_df = pd.concat([clean_df, chunk.drop(chunk.columns[0], axis=1)], 
                       axis=1)
    
    # sanity check for top 5 rows
    print new_df.head()
    break
    df = pd.concat([df, chunk])


  partner currency stk_flow    sitc06 geo  2014   2012   2010 
0  EXT_EU      EUR      EXP  SITC0-4A  AT   61.9  65.6     67 
1  EXT_EU      EUR      EXP  SITC0-4A  BE   53.8  85.8   92.4 
2  EXT_EU      EUR      EXP  SITC0-4A  BG   57.0  46.2   54.1 
3  EXT_EU      EUR      EXP  SITC0-4A  CY   79.1  60.7   61.4 
4  EXT_EU      EUR      EXP  SITC0-4A  CZ   58.3  66.7   59.1 

In [ ]: