We will use this dataset from the EU member states trades for this notebook:
http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ext_lt_invcur&lang=en
In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame()
The data set is too large to be read in 1 go without risking network interruption etc. Let's read it in chunk.
Data contained in pandas objects can be combined together in a number of built-in ways:
In [3]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
df = pd.concat([df, chunk])
In [4]:
df.head()
Out[4]:
The data set read in is messy. Notice that column 2 looks like can be further transformed during the loading phase.
Let's be thoughtful in how we "parse" the dataset.
In [8]:
df = pd.DataFrame()
Let's split out columns with "," to separate various sub-columns
In [9]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
data_rows = [row for row in chunk.ix[:,0].str.split(',')]
data_cols = chunk.columns[0].split(',')
print(data_rows[:5], data_cols)
break
df = pd.concat([df, chunk])
There is still issue with the "geo\time"
In [10]:
for chunk in pd.read_csv('data/ext_lt_invcur.tsv', sep='\t', chunksize=100):
# separate out all the data rows for column index 0
data_rows = [row for row in chunk.ix[:,0].str.split(',')]
# show me column indexed 0 and split on ","
data_cols = [col.split('\\')[0] for col in chunk.columns[0].split(',')]
# create new data frame with these new rows and columns
clean_df = pd.DataFrame(data_rows, columns=data_cols)
new_df = pd.concat([clean_df, chunk.drop(chunk.columns[0], axis=1)],
axis=1)
# sanity check for top 5 rows
print new_df.head()
break
df = pd.concat([df, chunk])
In [ ]: