A Pandas exploration. Find the best datastructure to explore and transform the dataset (both training and test dataframes). Use case:
In [ ]:
import sys
import os
import pandas as pd
import seaborn as sns
In [ ]:
sys.path.insert(1, os.path.join(sys.path[0], '..')) # add parent directory to path
import samlib
samlib.DataSetOriginal approach used in data_exploration_numerical_features
In [ ]:
raw_train = pd.read_csv('../data/train_prepared_light.csv')
raw_test = pd.read_csv('../data/test_prepared_light.csv')
ds = samlib.DataSet(raw_train, raw_test)
In [ ]:
is_num = ds.dtypes != object
dfnum = ds.df.loc[:, is_num]
dfnum.head()
In [ ]:
ds.apply(lambda df: df.loc[:, is_num]**2, inplace=True)
ds.df.head()
In [ ]:
ds.df.MasVnrArea.isnull().sum()
In [ ]:
ds.df.loc[ds.df.MasVnrArea.isnull(), 'MasVnrArea'] = 0
In [ ]:
ds.df.MasVnrArea.isnull().sum()
In [ ]:
sns.distplot(ds.train.GrLivArea)
Works but not so great because requires a new dependency (samlib) and a different way of working compared to Pandas. Need to learn the behaviour of the DataSet class, and remember to use the apply method otherwise the train and test sets are not going to be kept in sync (for example when assigning to a slice of ds.df)
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['dataset'] = 'train'
testdf['dataset'] = 'test'
df = pd.concat([traindf, testdf])
Then we can filter using the value of the dataset column
In [ ]:
train = df['dataset'] == 'train'
test = ~train
In [ ]:
df[train].head()
In [ ]:
df[test].head()
In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()
In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
sns.distplot(df.loc[train, 'GrLivArea'])
Works quite well but takes a bit of work to setup and requires to keep two boolean series (train and test) to filter the dataset whenever needed. An improvement over samlib.DataSet though.
In [ ]:
panel = pd.Panel({'train':raw_train.copy(), 'test': raw_test.copy()})
In [ ]:
panel.train.head()
In [ ]:
panel.test.head()
The above is very nice, but unfortunately a panel isn't a dataframe so we can't really get a view of the full data. Also we seem to have lost all the data types:
In [ ]:
is_num = panel.train.dtypes != object
any(is_num)
So we must keep the raw data if we want to filter the numerical columns :-(
In [ ]:
is_num = raw_train.dtypes != object
numpanel = panel.loc[:, :, is_num]
In [ ]:
numpanel
In [ ]:
numpanel.train.head()
Finally this raises an error!
In [ ]:
try:
panel.loc[:, :, is_num] = panel.loc[:, :, is_num]**2
except NotImplementedError as err:
print('raises NotImplementedError: ', err)
Looked promising initially but not really workable as we can't assign an indexer with a Panel yet. We really need a dataframe object.
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.head()
In [ ]:
df.tail()
The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.
In [ ]:
df.loc['train'].head()
In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()
In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
sns.distplot(df.GrLivArea.train)
Another way of doing it
In [ ]:
sns.distplot(df.loc['train', 'GrLivArea'])
Works very well.
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test')).swaplevel(axis=1)
df.sort_index(axis=1, inplace=True) # needed otherwise we get in trouble for slicing
In [ ]:
df.head()
In [ ]:
df.tail()
The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.
In [ ]:
df.xs('train', level=1, axis=1).head() # or use IndexSlice
We must also deal with the extra index level when filtering, but it's not too bad.
In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()
In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()
Getting nulls and setting nulls (without fillna) is a little tricky. Boolean indexing is (by definition) meant to work over rows, not rows *and columns. We can use boolean arrays with DataFrame.mask though. But this is definitely something to keep in mind when using multi indexing over columns.
In [ ]:
df.MasVnrArea = df.MasVnrArea.mask(df.MasVnrArea.isnull(), 0)
In [ ]:
df.MasVnrArea.tail()
Visualizing the training dataset is pretty easy.
In [ ]:
sns.distplot(df.GrLivArea.train)
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], axis=1, keys=('train','test'))
df.head()
In [ ]:
df.tail()
The test and train datasets can be accessed by filtering the index. Nice but not quite as compact as df[train], though we don't need the extra train (and test) masks.
In [ ]:
df.train.head()
We must also deal with the extra index level when filtering, but it's not too bad.
In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()
In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()
Definitely harder to slice accross columns. It's possible (unlike with panels), but hard (requires pd.IndexSlice).
In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()
You can also use a cross section to get the data more easily, but you can't use this for sssignments
In [ ]:
df.xs('MasVnrArea', axis=1, level=1).head()
In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']] = 0
In [ ]:
df.loc[:, pd.IndexSlice[:, 'MasVnrArea']].isnull().sum()
Visualizing the training dataset is pretty easy.
In [ ]:
sns.distplot(df.train.GrLivArea)
Method 1: add columns then use set_index
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
traindf['Dataset'] = 'train'
testdf['Dataset'] = 'test'
df = pd.concat([traindf, testdf])
df.set_index('Dataset').head()
Method 2: use concat and droplevel
In [ ]:
traindf = raw_train.copy()
testdf = raw_test.copy()
df = pd.concat([traindf, testdf], keys=('train', 'test'))
df.index = df.index.droplevel(1)
df.head()
In [ ]:
df.tail()
The test and train datasets can be accessed by using loc .
In [ ]:
df.loc['train'].head()
Filtering columns is very easy
In [ ]:
is_num = df.dtypes != object
dfnum = df.loc[:, is_num]
dfnum.head()
In [ ]:
df.loc[:, is_num] = dfnum **2
df.head()
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
df.loc[df.MasVnrArea.isnull(), 'MasVnrArea'] = 0
In [ ]:
df.MasVnrArea.isnull().sum()
In [ ]:
sns.distplot(df.GrLivArea.train)
Another way of doing it
In [ ]:
sns.distplot(df.loc['train', 'GrLivArea'])
dataset columntest and train alongside the data.