Exploring datastructures for dataset

A Pandas exploration. Find the best datastructure to explore and transform the dataset (both training and test dataframes). Use case:

find all numerical features (filtering)
transform all numerical features (e.g. take square)
replace NaN values for a numerical feature
plot distribution for a column in the training dataset



In [ ]:

    
import sys
import os

import pandas as pd
import seaborn as sns



In [ ]:

    
sys.path.insert(1, os.path.join(sys.path[0], '..'))  # add parent directory to path
import samlib

Using `samlib.DataSet`

Original approach used in data_exploration_numerical_features

class that contains 3 dataframes attributes (train, test, df, where df is the full dataframe
whenever df is updated, the train and test frames are updated This allows to work with the training dataset, and to update/transform the full dataset if necessary, so that the transformation is also applied to the test dataframe that will be needed for the final prediction.



In [ ]:

    
raw_train = pd.read_csv('../data/train_prepared_light.csv')
raw_test = pd.read_csv('../data/test_prepared_light.csv')
ds = samlib.DataSet(raw_train, raw_test)



In [ ]:

    
is_num = ds.dtypes != object
dfnum = ds.df.loc[:, is_num]
dfnum.head()



In [ ]:

    
ds.apply(lambda df: df.loc[:, is_num]**2, inplace=True)
ds.df.head()



In [ ]:

    
ds.df.MasVnrArea.isnull().sum()



In [ ]:

    
ds.df.loc[ds.df.MasVnrArea.isnull(), 'MasVnrArea'] = 0



In [ ]:

    
ds.df.MasVnrArea.isnull().sum()



In [ ]:

    
sns.distplot(ds.train.GrLivArea)

Works but not so great because requires a new dependency (samlib) and a different way of working compared to Pandas. Need to learn the behaviour of the DataSet class, and remember to use the apply method otherwise the train and test sets are not going to be kept in sync (for example when assigning to a slice of ds.df)

Using an extra categorical `dataset` column



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['dataset'] = 'train'
testdf['dataset'] = 'test'
df = pd.concat([traindf, testdf])

Then we can filter using the value of the dataset column



In [ ]:

    
train = df['dataset'] == 'train'
test = ~train



In [ ]:

    
df[train].head()



In [ ]:

    
df[test].head()



In [ ]:

    
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()



In [ ]:

    
df.loc[:, is_num] = dfnum **2
df.head()



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
sns.distplot(df.loc[train, 'GrLivArea'])

Works quite well but takes a bit of work to setup and requires to keep two boolean series (train and test) to filter the dataset whenever needed. An improvement over samlib.DataSet though.

Using Panel object



In [ ]:

    
panel = pd.Panel({'train':raw_train.copy(), 'test': raw_test.copy()})



In [ ]:

    
panel.train.head()



In [ ]:

    
panel.test.head()

The above is very nice, but unfortunately a panel isn't a dataframe so we can't really get a view of the full data. Also we seem to have lost all the data types:



In [ ]:

    
is_num = panel.train.dtypes != object
any(is_num)

So we must keep the raw data if we want to filter the numerical columns :-(



In [ ]:

    
is_num = raw_train.dtypes != object
numpanel = panel.loc[:, :, is_num]



In [ ]:

    
numpanel



In [ ]:

    
numpanel.train.head()

Finally this raises an error!



In [ ]:

    
try:
    panel.loc[:, :, is_num] = panel.loc[:, :, is_num]**2
except NotImplementedError as err:
    print('raises NotImplementedError: ', err)

Looked promising initially but not really workable as we can't assign an indexer with a Panel yet. We really need a dataframe object.

Using multi-index on rows



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.head()



In [ ]:

    
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.



In [ ]:

    
df.loc['train'].head()



In [ ]:

    
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()



In [ ]:

    
df.loc[:, is_num] = dfnum **2
df.head()



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
sns.distplot(df.GrLivArea.train)

Another way of doing it



In [ ]:

    
sns.distplot(df.loc['train', 'GrLivArea'])

Works very well.

Using multi-index on columns (swapped levels)

Swap the levels to fix the issue with filtering on features in the column multi-index case.



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test')).swaplevel(axis=1)
df.sort_index(axis=1, inplace=True)  # needed otherwise we get in trouble for slicing



In [ ]:

    
df.head()



In [ ]:

    
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.



In [ ]:

    
df.xs('train', level=1, axis=1).head()  # or use IndexSlice

We must also deal with the extra index level when filtering, but it's not too bad.



In [ ]:

    
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()



In [ ]:

    
df.loc[:, is_num] = dfnum **2
df.head()

Getting nulls and setting nulls (without fillna) is a little tricky. Boolean indexing is (by definition) meant to work over rows, not rows *and columns. We can use boolean arrays with DataFrame.mask though. But this is definitely something to keep in mind when using multi indexing over columns.



In [ ]:

    
df.MasVnrArea = df.MasVnrArea.mask(df.MasVnrArea.isnull(), 0)



In [ ]:

    
df.MasVnrArea.tail()

Visualizing the training dataset is pretty easy.



In [ ]:

    
sns.distplot(df.GrLivArea.train)

Using multi-index on columns

Makes it easier to filter on dataset (train or test) and has the advantage of being a dataframe.



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test'))
df.head()



In [ ]:

    
df.tail()

The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.



In [ ]:

    
df.train.head()

We must also deal with the extra index level when filtering, but it's not too bad.



In [ ]:

    
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()



In [ ]:

    
df.loc[:, is_num] = dfnum **2
df.head()

Definitely harder to slice accross columns. It's possible (unlike with panels), but hard (requires pd.IndexSlice).



In [ ]:

    
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()

You can also use a cross section to get the data more easily, but you can't use this for sssignments



In [ ]:

    
df.xs('MasVnrArea', axis=1, level=1).head()



In [ ]:

    
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']] = 0



In [ ]:

    
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()

Visualizing the training dataset is pretty easy.



In [ ]:

    
sns.distplot(df.train.GrLivArea)

Using dataset type as label

Method 1: add columns then use set_index



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['Dataset'] = 'train'
testdf['Dataset'] = 'test'
df = pd.concat([traindf, testdf])
df.set_index('Dataset').head()

Method 2: use concat and droplevel



In [ ]:

    
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.index = df.index.droplevel(1)
df.head()



In [ ]:

    
df.tail()

The test and train datasets can be accessed by using loc .



In [ ]:

    
df.loc['train'].head()

Filtering columns is very easy



In [ ]:

    
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()



In [ ]:

    
df.loc[:, is_num] = dfnum **2
df.head()



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0



In [ ]:

    
df.MasVnrArea.isnull().sum()



In [ ]:

    
sns.distplot(df.GrLivArea.train)

Another way of doing it



In [ ]:

    
sns.distplot(df.loc['train', 'GrLivArea'])

Discussion

Samlib

Pros: does most of what we need pretty easily
Cons: third party dependency, hackish, introduces new structure with weird behaviour (assining to a slice doesn't update training and test datasets)
Score: 2/5 ### Extra categorical dataset column
Pros: works very well and syntax is compact
Cons: a bit long to setup, requires to maintain mask variables test and train alongside the data.
Score; 4/5 ### Panel doesn't work ### Multi-index on rows
Pros: excellent, easy to filter on colums and on dataset
Cons: none
Score: 5/5 ### Multi-index on columns
Pros: easy to filter on train/test sets
Cons: hard to transform features for both datasets + would be weird if train and test sets have widely different numbers of indices
Score: 1/5 ### Dataset label
Pros: index is not a multi index
Cons: a bit hard to setup and index looks a bit weird as all samples have the same index
Score: 4/5

Conclusion

Use pd.concat([traindf, testdf], keys=['train', 'test']) to merge the datasets into one dataframe while making it easy to visualize/process features on only the training dataset.

Exploring datastructures for dataset

Using samlib.DataSet

Using an extra categorical dataset column

Using Panel object

Using multi-index on rows

Using multi-index on columns (swapped levels)

Using multi-index on columns

Using dataset type as label

Discussion

Samlib

Conclusion

Using `samlib.DataSet`

Using an extra categorical `dataset` column