Exploring datastructures for dataset

A Pandas exploration. Find the best datastructure to explore and transform the dataset (both training and test dataframes). Use case:

  • find all numerical features (filtering)
  • transform all numerical features (e.g. take square)
  • replace NaN values for a numerical feature
  • plot distribution for a column in the training dataset

In [ ]:
import sys
import os

import pandas as pd
import seaborn as sns

In [ ]:
sys.path.insert(1, os.path.join(sys.path[0], '..'))  # add parent directory to path
import samlib

Using samlib.DataSet

Original approach used in data_exploration_numerical_features

  • class that contains 3 dataframes attributes (train, test, df, where df is the full dataframe
  • whenever df is updated, the train and test frames are updated This allows to work with the training dataset, and to update/transform the full dataset if necessary, so that the transformation is also applied to the test dataframe that will be needed for the final prediction.

In [ ]:
raw_train = pd.read_csv('../data/train_prepared_light.csv')
raw_test = pd.read_csv('../data/test_prepared_light.csv')
ds = samlib.DataSet(raw_train, raw_test)

In [ ]:
is_num = ds.dtypes != object
dfnum = ds.df.loc[:, is_num]
dfnum.head()

In [ ]:
ds.apply(lambda df: df.loc[:, is_num]**2, inplace=True)
ds.df.head()

In [ ]:
ds.df.MasVnrArea.isnull().sum()

In [ ]:
ds.df.loc[ds.df.MasVnrArea.isnull(), 'MasVnrArea'] = 0

In [ ]:
ds.df.MasVnrArea.isnull().sum()

In [ ]:
sns.distplot(ds.train.GrLivArea)

Works but not so great because requires a new dependency (samlib) and a different way of working compared to Pandas. Need to learn the behaviour of the DataSet class, and remember to use the apply method otherwise the train and test sets are not going to be kept in sync (for example when assigning to a slice of ds.df)

Using an extra categorical dataset column


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['dataset'] = 'train'
testdf['dataset'] = 'test'
df = pd.concat([traindf, testdf])

Then we can filter using the value of the dataset column


In [ ]:
train = df['dataset'] == 'train'
test = ~train

In [ ]:
df[train].head()

In [ ]:
df[test].head()

In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()

In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
sns.distplot(df.loc[train, 'GrLivArea'])

Works quite well but takes a bit of work to setup and requires to keep two boolean series (train and test) to filter the dataset whenever needed. An improvement over samlib.DataSet though.

Using Panel object


In [ ]:
panel = pd.Panel({'train':raw_train.copy(), 'test': raw_test.copy()})

In [ ]:
panel.train.head()

In [ ]:
panel.test.head()

The above is very nice, but unfortunately a panel isn't a dataframe so we can't really get a view of the full data. Also we seem to have lost all the data types:


In [ ]:
is_num = panel.train.dtypes != object
any(is_num)

So we must keep the raw data if we want to filter the numerical columns :-(


In [ ]:
is_num = raw_train.dtypes != object
numpanel = panel.loc[:, :, is_num]

In [ ]:
numpanel

In [ ]:
numpanel.train.head()

Finally this raises an error!


In [ ]:
try:
    panel.loc[:, :, is_num] = panel.loc[:, :, is_num]**2
except NotImplementedError as err:
    print('raises NotImplementedError: ', err)

Looked promising initially but not really workable as we can't assign an indexer with a Panel yet. We really need a dataframe object.

Using multi-index on rows


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.head()

In [ ]:
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.


In [ ]:
df.loc['train'].head()

In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()

In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
sns.distplot(df.GrLivArea.train)

Another way of doing it


In [ ]:
sns.distplot(df.loc['train', 'GrLivArea'])

Works very well.

Using multi-index on columns (swapped levels)

Swap the levels to fix the issue with filtering on features in the column multi-index case.


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test')).swaplevel(axis=1)
df.sort_index(axis=1, inplace=True)  # needed otherwise we get in trouble for slicing

In [ ]:
df.head()

In [ ]:
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.


In [ ]:
df.xs('train', level=1, axis=1).head()  # or use IndexSlice

We must also deal with the extra index level when filtering, but it's not too bad.


In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()

In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()

Getting nulls and setting nulls (without fillna) is a little tricky. Boolean indexing is (by definition) meant to work over rows, not rows *and columns. We can use boolean arrays with DataFrame.mask though. But this is definitely something to keep in mind when using multi indexing over columns.


In [ ]:
df.MasVnrArea = df.MasVnrArea.mask(df.MasVnrArea.isnull(), 0)

In [ ]:
df.MasVnrArea.tail()

Visualizing the training dataset is pretty easy.


In [ ]:
sns.distplot(df.GrLivArea.train)

Using multi-index on columns

Makes it easier to filter on dataset (train or test) and has the advantage of being a dataframe.


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test'))
df.head()

In [ ]:
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.


In [ ]:
df.train.head()

We must also deal with the extra index level when filtering, but it's not too bad.


In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()

In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()

Definitely harder to slice accross columns. It's possible (unlike with panels), but hard (requires pd.IndexSlice).


In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()

You can also use a cross section to get the data more easily, but you can't use this for sssignments


In [ ]:
df.xs('MasVnrArea', axis=1, level=1).head()

In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']] = 0

In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()

Visualizing the training dataset is pretty easy.


In [ ]:
sns.distplot(df.train.GrLivArea)

Using dataset type as label

Method 1: add columns then use set_index


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['Dataset'] = 'train'
testdf['Dataset'] = 'test'
df = pd.concat([traindf, testdf])
df.set_index('Dataset').head()

Method 2: use concat and droplevel


In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.index = df.index.droplevel(1)
df.head()

In [ ]:
df.tail()

The test and train datasets can be accessed by using loc .


In [ ]:
df.loc['train'].head()

Filtering columns is very easy


In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()

In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0

In [ ]:
df.MasVnrArea.isnull().sum()

In [ ]:
sns.distplot(df.GrLivArea.train)

Another way of doing it


In [ ]:
sns.distplot(df.loc['train', 'GrLivArea'])

Discussion

Samlib

  • Pros: does most of what we need pretty easily
  • Cons: third party dependency, hackish, introduces new structure with weird behaviour (assining to a slice doesn't update training and test datasets)
  • Score: 2/5 ### Extra categorical dataset column
  • Pros: works very well and syntax is compact
  • Cons: a bit long to setup, requires to maintain mask variables test and train alongside the data.
  • Score; 4/5 ### Panel doesn't work ### Multi-index on rows
  • Pros: excellent, easy to filter on colums and on dataset
  • Cons: none
  • Score: 5/5 ### Multi-index on columns
  • Pros: easy to filter on train/test sets
  • Cons: hard to transform features for both datasets + would be weird if train and test sets have widely different numbers of indices
  • Score: 1/5 ### Dataset label
  • Pros: index is not a multi index
  • Cons: a bit hard to setup and index looks a bit weird as all samples have the same index
  • Score: 4/5

Conclusion

Use pd.concat([traindf, testdf], keys=['train', 'test']) to merge the datasets into one dataframe while making it easy to visualize/process features on only the training dataset.