Data Import, Merge, Wrangle

We will be using the real dataset for extra-EU trade percentages for a few different years to illutrate the real-world usage of data import, cleanse, merge and wrangle.


In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame()

In [3]:
df


Out[3]:

Concatenating datasets

Let's incrementally read in the EU trade dataset


In [4]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    df = pd.concat([df, chunk])

df.head(3)


Out[4]:
partner,currency,stk_flow,sitc06,geo\time 2014 2012 2010
0 EXT_EU,EUR,EXP,SITC0-4A,AT 61.9 65.6 67
1 EXT_EU,EUR,EXP,SITC0-4A,BE 53.8 85.8 92.4
2 EXT_EU,EUR,EXP,SITC0-4A,BG 57.0 46.2 54.1

Transforming a column

Let's preprocess this dataset to split out the 1st column into more manageable pieces.


In [8]:
df = pd.DataFrame()

# let's figure out how to split out the 1st column's values
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    data_cols = chunk.columns[0].split(',')
    print(data_rows[:2], data_cols)
    break; 
    df = pd.concat([df, chunk])


([['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'AT'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BE']], ['partner', 'currency', 'stk_flow', 'sitc06', 'geo\\time'])

Looks like we are getting the right value split. However, the last column name split looks odd.


In [9]:
df = pd.DataFrame()
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    # create a column split list generator
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    print(data_rows[:2], data_cols)
    break; 
    df = pd.concat([df, chunk])


([['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'AT'], ['EXT_EU', 'EUR', 'EXP', 'SITC0-4A', 'BE']], ['partner', 'currency', 'stk_flow', 'sitc06', 'geo'])

The last column name is now called "geo" -- simple and elegant.

Now, we need to merge the list of lists to create the cleaned dataframe:


In [10]:
df = pd.DataFrame()
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    # create a column split list generator
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    clean_df = pd.DataFrame(data_rows, columns=data_cols)
    
    # now we can concat by "column" which means axis=1
    new_df = pd.concat([clean_df, chunk], axis=1)
    print(new_df)
    break; 
    df = pd.concat([df, chunk])


   partner currency stk_flow    sitc06 geo  \
0   EXT_EU      EUR      EXP  SITC0-4A  AT   
1   EXT_EU      EUR      EXP  SITC0-4A  BE   
2   EXT_EU      EUR      EXP  SITC0-4A  BG   
3   EXT_EU      EUR      EXP  SITC0-4A  CY   
4   EXT_EU      EUR      EXP  SITC0-4A  CZ   
5   EXT_EU      EUR      EXP  SITC0-4A  DE   
6   EXT_EU      EUR      EXP  SITC0-4A  DK   
7   EXT_EU      EUR      EXP  SITC0-4A  EA   
8   EXT_EU      EUR      EXP  SITC0-4A  EE   
9   EXT_EU      EUR      EXP  SITC0-4A  EL   
10  EXT_EU      EUR      EXP  SITC0-4A  ES   
11  EXT_EU      EUR      EXP  SITC0-4A  EU   
12  EXT_EU      EUR      EXP  SITC0-4A  FI   
13  EXT_EU      EUR      EXP  SITC0-4A  FR   
14  EXT_EU      EUR      EXP  SITC0-4A  HR   
15  EXT_EU      EUR      EXP  SITC0-4A  HU   
16  EXT_EU      EUR      EXP  SITC0-4A  IE   
17  EXT_EU      EUR      EXP  SITC0-4A  IT   
18  EXT_EU      EUR      EXP  SITC0-4A  LT   
19  EXT_EU      EUR      EXP  SITC0-4A  LU   
20  EXT_EU      EUR      EXP  SITC0-4A  LV   
21  EXT_EU      EUR      EXP  SITC0-4A  MT   
22  EXT_EU      EUR      EXP  SITC0-4A  NL   
23  EXT_EU      EUR      EXP  SITC0-4A  PL   
24  EXT_EU      EUR      EXP  SITC0-4A  PT   
25  EXT_EU      EUR      EXP  SITC0-4A  RO   
26  EXT_EU      EUR      EXP  SITC0-4A  SE   
27  EXT_EU      EUR      EXP  SITC0-4A  SI   
28  EXT_EU      EUR      EXP  SITC0-4A  SK   
29  EXT_EU      EUR      EXP  SITC0-4A  UK   
..     ...      ...      ...       ...  ..   
70  EXT_EU      EUR      EXP   SITC5-8  ES   
71  EXT_EU      EUR      EXP   SITC5-8  EU   
72  EXT_EU      EUR      EXP   SITC5-8  FI   
73  EXT_EU      EUR      EXP   SITC5-8  FR   
74  EXT_EU      EUR      EXP   SITC5-8  HR   
75  EXT_EU      EUR      EXP   SITC5-8  HU   
76  EXT_EU      EUR      EXP   SITC5-8  IE   
77  EXT_EU      EUR      EXP   SITC5-8  IT   
78  EXT_EU      EUR      EXP   SITC5-8  LT   
79  EXT_EU      EUR      EXP   SITC5-8  LU   
80  EXT_EU      EUR      EXP   SITC5-8  LV   
81  EXT_EU      EUR      EXP   SITC5-8  MT   
82  EXT_EU      EUR      EXP   SITC5-8  NL   
83  EXT_EU      EUR      EXP   SITC5-8  PL   
84  EXT_EU      EUR      EXP   SITC5-8  PT   
85  EXT_EU      EUR      EXP   SITC5-8  RO   
86  EXT_EU      EUR      EXP   SITC5-8  SE   
87  EXT_EU      EUR      EXP   SITC5-8  SI   
88  EXT_EU      EUR      EXP   SITC5-8  SK   
89  EXT_EU      EUR      EXP   SITC5-8  UK   
90  EXT_EU      EUR      EXP     TOTAL  AT   
91  EXT_EU      EUR      EXP     TOTAL  BE   
92  EXT_EU      EUR      EXP     TOTAL  BG   
93  EXT_EU      EUR      EXP     TOTAL  CY   
94  EXT_EU      EUR      EXP     TOTAL  CZ   
95  EXT_EU      EUR      EXP     TOTAL  DE   
96  EXT_EU      EUR      EXP     TOTAL  DK   
97  EXT_EU      EUR      EXP     TOTAL  EA   
98  EXT_EU      EUR      EXP     TOTAL  EE   
99  EXT_EU      EUR      EXP     TOTAL  EL   

   partner,currency,stk_flow,sitc06,geo\time  2014   2012   2010   
0                 EXT_EU,EUR,EXP,SITC0-4A,AT   61.9  65.6     67   
1                 EXT_EU,EUR,EXP,SITC0-4A,BE   53.8  85.8   92.4   
2                 EXT_EU,EUR,EXP,SITC0-4A,BG   57.0  46.2   54.1   
3                 EXT_EU,EUR,EXP,SITC0-4A,CY   79.1  60.7   61.4   
4                 EXT_EU,EUR,EXP,SITC0-4A,CZ   58.3  66.7   59.1   
5                 EXT_EU,EUR,EXP,SITC0-4A,DE   62.5  61.5   65.9   
6                 EXT_EU,EUR,EXP,SITC0-4A,DK   12.8    14   12.2   
7                 EXT_EU,EUR,EXP,SITC0-4A,EA   60.7  65.4   64.1   
8                 EXT_EU,EUR,EXP,SITC0-4A,EE   67.9  62.8   51.8   
9                 EXT_EU,EUR,EXP,SITC0-4A,EL   60.3  58.4     59   
10                EXT_EU,EUR,EXP,SITC0-4A,ES   61.8  63.7   75.6   
11                EXT_EU,EUR,EXP,SITC0-4A,EU   50.1  53.5   53.2   
12                EXT_EU,EUR,EXP,SITC0-4A,FI   42.4  40.7   47.7   
13                EXT_EU,EUR,EXP,SITC0-4A,FR   63.8  62.3   58.4   
14                EXT_EU,EUR,EXP,SITC0-4A,HR   77.3     :      :   
15                EXT_EU,EUR,EXP,SITC0-4A,HU   45.4  45.5   67.8   
16                EXT_EU,EUR,EXP,SITC0-4A,IE   45.1  38.9   36.8   
17                EXT_EU,EUR,EXP,SITC0-4A,IT   73.1  73.1   70.6   
18                EXT_EU,EUR,EXP,SITC0-4A,LT   75.8  79.4   70.9   
19                EXT_EU,EUR,EXP,SITC0-4A,LU   99.0  96.6   94.8   
20                EXT_EU,EUR,EXP,SITC0-4A,LV   56.4  49.7   59.5   
21                EXT_EU,EUR,EXP,SITC0-4A,MT   13.6  34.2    8.7   
22                EXT_EU,EUR,EXP,SITC0-4A,NL   52.4  53.8   53.1   
23                EXT_EU,EUR,EXP,SITC0-4A,PL    0.0     0    0.5   
24                EXT_EU,EUR,EXP,SITC0-4A,PT   72.8  72.8     78   
25                EXT_EU,EUR,EXP,SITC0-4A,RO   45.3  43.4   38.1   
26                EXT_EU,EUR,EXP,SITC0-4A,SE   17.4  14.8   27.4   
27                EXT_EU,EUR,EXP,SITC0-4A,SI   85.7  92.1   91.2   
28                EXT_EU,EUR,EXP,SITC0-4A,SK   93.4  96.5   93.6   
29                EXT_EU,EUR,EXP,SITC0-4A,UK    4.6   3.1    3.2   
..                                       ...    ...    ...    ...  
70                 EXT_EU,EUR,EXP,SITC5-8,ES   71.9  71.4   74.8   
71                 EXT_EU,EUR,EXP,SITC5-8,EU   51.0  51.8   52.8   
72                 EXT_EU,EUR,EXP,SITC5-8,FI   65.8  54.9   49.7   
73                 EXT_EU,EUR,EXP,SITC5-8,FR   52.3    53     53   
74                 EXT_EU,EUR,EXP,SITC5-8,HR   69.8     :      :   
75                 EXT_EU,EUR,EXP,SITC5-8,HU   51.9  45.1   46.6   
76                 EXT_EU,EUR,EXP,SITC5-8,IE   11.4   9.1    9.5   
77                 EXT_EU,EUR,EXP,SITC5-8,IT   75.7  75.9   76.6   
78                 EXT_EU,EUR,EXP,SITC5-8,LT   66.6  58.7   56.5   
79                 EXT_EU,EUR,EXP,SITC5-8,LU   58.3  76.1   69.6   
80                 EXT_EU,EUR,EXP,SITC5-8,LV   70.9  70.9   71.6   
81                 EXT_EU,EUR,EXP,SITC5-8,MT   23.7  27.3   22.3   
82                 EXT_EU,EUR,EXP,SITC5-8,NL   52.6  51.6   55.7   
83                 EXT_EU,EUR,EXP,SITC5-8,PL    0.0     0    0.3   
84                 EXT_EU,EUR,EXP,SITC5-8,PT   70.6  72.5   74.3   
85                 EXT_EU,EUR,EXP,SITC5-8,RO   64.0  64.4   63.5   
86                 EXT_EU,EUR,EXP,SITC5-8,SE   22.0  24.8   22.8   
87                 EXT_EU,EUR,EXP,SITC5-8,SI   78.1  82.3   85.2   
88                 EXT_EU,EUR,EXP,SITC5-8,SK   82.6    88     81   
89                 EXT_EU,EUR,EXP,SITC5-8,UK    4.1   4.1    4.1   
90                   EXT_EU,EUR,EXP,TOTAL,AT   71.7  73.6   74.4   
91                   EXT_EU,EUR,EXP,TOTAL,BE   41.4  50.9   51.6   
92                   EXT_EU,EUR,EXP,TOTAL,BG   41.7  36.7   43.2   
93                   EXT_EU,EUR,EXP,TOTAL,CY   67.4  57.2   63.5   
94                   EXT_EU,EUR,EXP,TOTAL,CZ   49.5  50.5   50.5   
95                   EXT_EU,EUR,EXP,TOTAL,DE   61.8  64.6   66.6   
96                   EXT_EU,EUR,EXP,TOTAL,DK   22.7  22.3   22.8   
97                   EXT_EU,EUR,EXP,TOTAL,EA   58.7  60.3   60.4   
98                   EXT_EU,EUR,EXP,TOTAL,EE   63.0  53.8   44.1   
99                   EXT_EU,EUR,EXP,TOTAL,EL   32.9  36.6   44.8   

[100 rows x 9 columns]

We notice that we still have our own column[0] around -- let's drop it to clean up the DF.


In [12]:
df = pd.DataFrame()
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    # create a column split list generator
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    clean_df = pd.DataFrame(data_rows, columns=data_cols)
    
    # now we can concat by "column" which means axis=1
    new_df = pd.concat([clean_df, 
                        chunk.drop(chunk.columns[0], axis=1)], axis=1)
    print(new_df)
    break; 
    df = pd.concat([df, chunk])


   partner currency stk_flow    sitc06 geo  2014   2012   2010 
0   EXT_EU      EUR      EXP  SITC0-4A  AT   61.9  65.6     67 
1   EXT_EU      EUR      EXP  SITC0-4A  BE   53.8  85.8   92.4 
2   EXT_EU      EUR      EXP  SITC0-4A  BG   57.0  46.2   54.1 
3   EXT_EU      EUR      EXP  SITC0-4A  CY   79.1  60.7   61.4 
4   EXT_EU      EUR      EXP  SITC0-4A  CZ   58.3  66.7   59.1 
5   EXT_EU      EUR      EXP  SITC0-4A  DE   62.5  61.5   65.9 
6   EXT_EU      EUR      EXP  SITC0-4A  DK   12.8    14   12.2 
7   EXT_EU      EUR      EXP  SITC0-4A  EA   60.7  65.4   64.1 
8   EXT_EU      EUR      EXP  SITC0-4A  EE   67.9  62.8   51.8 
9   EXT_EU      EUR      EXP  SITC0-4A  EL   60.3  58.4     59 
10  EXT_EU      EUR      EXP  SITC0-4A  ES   61.8  63.7   75.6 
11  EXT_EU      EUR      EXP  SITC0-4A  EU   50.1  53.5   53.2 
12  EXT_EU      EUR      EXP  SITC0-4A  FI   42.4  40.7   47.7 
13  EXT_EU      EUR      EXP  SITC0-4A  FR   63.8  62.3   58.4 
14  EXT_EU      EUR      EXP  SITC0-4A  HR   77.3     :      : 
15  EXT_EU      EUR      EXP  SITC0-4A  HU   45.4  45.5   67.8 
16  EXT_EU      EUR      EXP  SITC0-4A  IE   45.1  38.9   36.8 
17  EXT_EU      EUR      EXP  SITC0-4A  IT   73.1  73.1   70.6 
18  EXT_EU      EUR      EXP  SITC0-4A  LT   75.8  79.4   70.9 
19  EXT_EU      EUR      EXP  SITC0-4A  LU   99.0  96.6   94.8 
20  EXT_EU      EUR      EXP  SITC0-4A  LV   56.4  49.7   59.5 
21  EXT_EU      EUR      EXP  SITC0-4A  MT   13.6  34.2    8.7 
22  EXT_EU      EUR      EXP  SITC0-4A  NL   52.4  53.8   53.1 
23  EXT_EU      EUR      EXP  SITC0-4A  PL    0.0     0    0.5 
24  EXT_EU      EUR      EXP  SITC0-4A  PT   72.8  72.8     78 
25  EXT_EU      EUR      EXP  SITC0-4A  RO   45.3  43.4   38.1 
26  EXT_EU      EUR      EXP  SITC0-4A  SE   17.4  14.8   27.4 
27  EXT_EU      EUR      EXP  SITC0-4A  SI   85.7  92.1   91.2 
28  EXT_EU      EUR      EXP  SITC0-4A  SK   93.4  96.5   93.6 
29  EXT_EU      EUR      EXP  SITC0-4A  UK    4.6   3.1    3.2 
..     ...      ...      ...       ...  ..    ...    ...    ...
70  EXT_EU      EUR      EXP   SITC5-8  ES   71.9  71.4   74.8 
71  EXT_EU      EUR      EXP   SITC5-8  EU   51.0  51.8   52.8 
72  EXT_EU      EUR      EXP   SITC5-8  FI   65.8  54.9   49.7 
73  EXT_EU      EUR      EXP   SITC5-8  FR   52.3    53     53 
74  EXT_EU      EUR      EXP   SITC5-8  HR   69.8     :      : 
75  EXT_EU      EUR      EXP   SITC5-8  HU   51.9  45.1   46.6 
76  EXT_EU      EUR      EXP   SITC5-8  IE   11.4   9.1    9.5 
77  EXT_EU      EUR      EXP   SITC5-8  IT   75.7  75.9   76.6 
78  EXT_EU      EUR      EXP   SITC5-8  LT   66.6  58.7   56.5 
79  EXT_EU      EUR      EXP   SITC5-8  LU   58.3  76.1   69.6 
80  EXT_EU      EUR      EXP   SITC5-8  LV   70.9  70.9   71.6 
81  EXT_EU      EUR      EXP   SITC5-8  MT   23.7  27.3   22.3 
82  EXT_EU      EUR      EXP   SITC5-8  NL   52.6  51.6   55.7 
83  EXT_EU      EUR      EXP   SITC5-8  PL    0.0     0    0.3 
84  EXT_EU      EUR      EXP   SITC5-8  PT   70.6  72.5   74.3 
85  EXT_EU      EUR      EXP   SITC5-8  RO   64.0  64.4   63.5 
86  EXT_EU      EUR      EXP   SITC5-8  SE   22.0  24.8   22.8 
87  EXT_EU      EUR      EXP   SITC5-8  SI   78.1  82.3   85.2 
88  EXT_EU      EUR      EXP   SITC5-8  SK   82.6    88     81 
89  EXT_EU      EUR      EXP   SITC5-8  UK    4.1   4.1    4.1 
90  EXT_EU      EUR      EXP     TOTAL  AT   71.7  73.6   74.4 
91  EXT_EU      EUR      EXP     TOTAL  BE   41.4  50.9   51.6 
92  EXT_EU      EUR      EXP     TOTAL  BG   41.7  36.7   43.2 
93  EXT_EU      EUR      EXP     TOTAL  CY   67.4  57.2   63.5 
94  EXT_EU      EUR      EXP     TOTAL  CZ   49.5  50.5   50.5 
95  EXT_EU      EUR      EXP     TOTAL  DE   61.8  64.6   66.6 
96  EXT_EU      EUR      EXP     TOTAL  DK   22.7  22.3   22.8 
97  EXT_EU      EUR      EXP     TOTAL  EA   58.7  60.3   60.4 
98  EXT_EU      EUR      EXP     TOTAL  EE   63.0  53.8   44.1 
99  EXT_EU      EUR      EXP     TOTAL  EL   32.9  36.6   44.8 

[100 rows x 8 columns]

This looks a lot cleaner!

We are finally ready to run this for all input data:


In [30]:
df = pd.DataFrame()
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
    data_rows = [row for row in chunk.ix[:,0].str.split(',')]
    # create a column split list generator
    data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
    clean_df = pd.DataFrame(data_rows, columns=data_cols)
    
    # now we can concat by "column" which means axis=1
    new_df = pd.concat([clean_df, 
                        chunk.drop(chunk.columns[0], axis=1)], axis=1)
    df = pd.concat([df, new_df])
    
df.head(3)


Out[30]:
partner currency stk_flow sitc06 geo 2014 2012 2010
0 EXT_EU EUR EXP SITC0-4A AT 61.9 65.6 67
1 EXT_EU EUR EXP SITC0-4A BE 53.8 85.8 92.4
2 EXT_EU EUR EXP SITC0-4A BG 57.0 46.2 54.1

We have concat data both horizontally and vertically, and able to clean up the column name as well as transforming a messy column by splitting down its components.


In [35]:
df.shape


Out[35]:
(1320, 8)

In [33]:
df.describe(include='all')


Out[33]:
partner currency stk_flow sitc06 geo 2014 2012 2010
count 1320 1320 1320 1320 1320 1320.000000 1320 1320
unique 2 5 2 4 33 NaN 518 471
top EXT_EU OTH IMP SITC5-8 UK NaN 100 100
freq 1200 264 660 330 40 NaN 238 248
mean NaN NaN NaN NaN NaN 39.998712 NaN NaN
std NaN NaN NaN NaN NaN 39.025858 NaN NaN
min NaN NaN NaN NaN NaN 0.000000 NaN NaN
25% NaN NaN NaN NaN NaN 2.275000 NaN NaN
50% NaN NaN NaN NaN NaN 28.650000 NaN NaN
75% NaN NaN NaN NaN NaN 75.800000 NaN NaN
max NaN NaN NaN NaN NaN 100.000000 NaN NaN

In [38]:
df.loc[df['sitc06']=='TOTAL']


Out[38]:
90    False
91     True
92    False
93    False
94    False
95    False
96    False
97    False
98    False
99    False
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
      ...  
9     False
10    False
11    False
21    False
22    False
23    False
33    False
34    False
35    False
45    False
46    False
47    False
57    False
58    False
59    False
69    False
70    False
71    False
81    False
82    False
83    False
93    False
94    False
95    False
5     False
6     False
7     False
17    False
18    False
19    False
Name: geo, dtype: bool

In [ ]: