Scikit-Data Widget Introduction

Scikit-Data library offers a set of functionalities to help the Data Analysts in their work.

Initially is just a small set of simple functionalities like convert a dataframe in a crostab dataframe using some specifics fields.

Other interesting functionality is offer a jupyter widget to offer interactive options to handle the data with graphical and tabular outputs.

To import the Scikit-Data Jupyter Widget just use the following code:

from skdata.widgets import SkDataWidget

In [1]:
from IPython.display import Image
from skdata.widgets import SkDataWidget
from skdata import SkData


/home/xmn/miniconda3/envs/skdata/lib/python3.6/site-packages/odo/backends/pandas.py:102: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access NaTType as type(pandas.NaT)
  @convert.register((pd.Timestamp, pd.Timedelta), (pd.tslib.NaTType, type(None)))

Load data to the analysis and visualization

The data used in this example was extracted from Kaggle Titanic challenge.

Variables description:

  • survival Survival (0 = No; 1 = Yes)
  • pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name Name
  • sex Sex
  • age Age
  • sibsp Number of Siblings/Spouses Aboard
  • parch Number of Parents/Children Aboard
  • ticket Ticket Number
  • fare Passenger Fare
  • cabin Cabin
  • embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations."


In [2]:
sd = SkData('/tmp/titanic.h5')
sd.import_from(
    source='../data/train.csv', index_col='PassengerId',
    target_col='Survived'
)

Widget

To use SkDataWidget class, you need some SkData loaded:

w = SkDataWidget(sd)

You can use the show_chart method to change some parameters of the chart that show information of a cross tab of the fields selected:

w.display(dset_id='dset_id')

This method will use the parameters informed and create and show a chart and a data table.


In [3]:
sd['train'].summary()


Out[3]:
Types Set Values Count Set # Observations # NaN
Survived int64 [0, 1] 2 891 0
Pclass int64 [1, 2, 3] 3 891 0
Name object ['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ... 891 891 0
Sex object ['female', 'male'] 2 891 0
Age float64 [0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, ... 88 714 177
SibSp int64 [0, 1, 2, 3, 4, 5, 8] 7 891 0
Parch int64 [0, 1, 2, 3, 4, 5, 6] 7 891 0
Ticket object ['110152', '110413', '110465', '110564', '1108... 681 891 0
Fare float64 [0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495... 248 891 0
Cabin object ['A10', 'A14', 'A16', 'A19', 'A20', 'A23', 'A2... 147 204 687
Embarked object ['C', 'Q', 'S'] 3 889 2

In [4]:
w = SkDataWidget(sd)
w.display(dset_id='train')


This should display the follow screen:


In [5]:
Image(filename='../data/img/initial_screen.png')


Out[5]:

If you want to see the chart just click at Chart option and you will see something like that:


In [6]:
Image(filename='../data/img/chart_screen.png')


Out[6]:

By default, the chart is displayed crossing each fields Xs with Y (chart type=individual). If you want to see a unique chart with all selected fields Xs crossed with Y field, select the chart type option grouped.

Conclusion

These are an initial functionalities to help handle and observe data phenomenons in a very quick way.