Merging Data

We will use this dataset from the EU member states trades for this notebook:

http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ext_lt_invcur&lang=en



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.DataFrame()

The data set is too large to be read in 1 go without risking network interruption etc. Let's read it in chunk.

Data contained in pandas objects can be combined together in a number of built-in ways:

pandas.concat glues or stacks together objects along an axis.
pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.



In [3]:

    
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    df = pd.concat([df, chunk])



In [4]:

    
df.head()









    Out[4]:






  
    
      
      partner,currency,stk_flow,sitc06,geo\time
      2014
      2012
      2010
    
  
  
    
      0
      EXT_EU,EUR,EXP,SITC0-4A,AT
      61.9
      65.6
      67
    
    
      1
      EXT_EU,EUR,EXP,SITC0-4A,BE
      53.8
      85.8
      92.4
    
    
      2
      EXT_EU,EUR,EXP,SITC0-4A,BG
      57.0
      46.2
      54.1
    
    
      3
      EXT_EU,EUR,EXP,SITC0-4A,CY
      79.1
      60.7
      61.4
    
    
      4
      EXT_EU,EUR,EXP,SITC0-4A,CZ
      58.3
      66.7
      59.1

The data set read in is messy. Notice that column 2 looks like can be further transformed during the loading phase.

Let's be thoughtful in how we "parse" the dataset.



In [8]:

    
df = pd.DataFrame()

Let's split out columns with "," to separate various sub-columns



In [9]:

    
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    data_cols = chunk.columns[0].split(',')
    print(data_rows[:5], data_cols)
    break
    df = pd.concat([df, chunk])









    



([['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'AT'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BE'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BG'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'CY'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'CZ']], ['partner', 'currency', 'stk_flow', 'sitc06', 'geo\\time'])

There is still issue with the "geo\time"



In [10]:

    
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    # separate out all the data rows for column index 0
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    
    # show me column indexed 0 and split on ","
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    
    # create new data frame with these new rows and columns
    clean_df = pd.DataFrame(data_rows, columns=data_cols)
    new_df = pd.concat([clean_df, chunk.drop(chunk.columns[0], axis=1)], 
                       axis=1)
    
    # sanity check for top 5 rows
    print new_df.head()
    break
    df = pd.concat([df, chunk])









    



  partner currency stk_flow    sitc06 geo  2014   2012   2010 
0  EXT_EU      EUR      EXP  SITC0-4A  AT   61.9  65.6     67 
1  EXT_EU      EUR      EXP  SITC0-4A  BE   53.8  85.8   92.4 
2  EXT_EU      EUR      EXP  SITC0-4A  BG   57.0  46.2   54.1 
3  EXT_EU      EUR      EXP  SITC0-4A  CY   79.1  60.7   61.4 
4  EXT_EU      EUR      EXP  SITC0-4A  CZ   58.3  66.7   59.1



In [ ]:

	partner,currency,stk_flow,sitc06,geo\time	2014	2012	2010
0	EXT_EU,EUR,EXP,SITC0-4A,AT	61.9	65.6	67
1	EXT_EU,EUR,EXP,SITC0-4A,BE	53.8	85.8	92.4
2	EXT_EU,EUR,EXP,SITC0-4A,BG	57.0	46.2	54.1
3	EXT_EU,EUR,EXP,SITC0-4A,CY	79.1	60.7	61.4
4	EXT_EU,EUR,EXP,SITC0-4A,CZ	58.3	66.7	59.1