Applied Data Science

Data Wrangling

Niall Twomey

To contact, please email <firstname>.<lastname>@bristol.ac.uk

This notebook considers the CASAS dataset. This is a dataset collected in a smart environment. As participants interact with the house, sensors record their interactions. There are a number of different sensor types including motion, door contact, light, temperature, water flow, etc. (see sensorlayout2.png)

This notebook goes through a number of common issues in data science when working with real data. Namely, several issues relating to dates, sensor values, etc. This are dealt with consistently using the functionality provided by the pandas library.

The objective is to fix all errors (if we can), and then to convert the timeseries data to a form that would be recognisable by a machine learning algorithm. I have attempted to comment my code where possible to explain my thought processes. At several points in this script I could have taken shortcuts, but I also attempted to forgo brevity for clarity.

For more detail on this dataset, see

Cook, Diane J., and Maureen Schmitter-Edgecombe. "Assessing the quality of activities in a smart environment." Methods of information in medicine 48.5 (2009): 480.
Twomey, Niall, Tom Diethe, and Peter Flach. "On the need for structure modelling in sequence prediction." Machine Learning 104.2-3 (2016): 291-314.
Twomey, Niall, et al. "Unsupervised learning of sensor topologies for improving activity recognition in smart environments." Neurocomputing (2016).
Diethe, Tom, Niall Twomey, and Peter Flach. "Bayesian modelling of the temporal aspects of smart home activity with circular statistics." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2015.
Twomey, Niall, and Peter Flach. "Context modulation of sensor data applied to activity recognition in smart homes." Workshop on Learning over Multiple Contexts, European Conference on Machine Learning (ECML’14). 2014.



In [1]:

    
## from __future__ import print_function  # uncomment if using python 2
from os.path import join 

import pandas as pd
import numpy as np
from datetime import datetime

%matplotlib inline

Set up various parameters and variables that will be used in this script



In [2]:

    
url = 'http://casas.wsu.edu/datasets/twor.2009.zip'
zipfile = url.split('/')[-1]
dirname = '.'.join(zipfile.split('.')[:2])
filename = join(dirname, 'data')

print('     url: {}'.format(url))
print(' zipfile: {}'.format(zipfile))
print(' dirname: {}'.format(dirname))
print('filename: {}'.format(filename))









    



     url: http://casas.wsu.edu/datasets/twor.2009.zip
 zipfile: twor.2009.zip
 dirname: twor.2009
filename: twor.2009/data

Download the dataset, and unzip it using the following commands in shell

wget http://casas.wsu.edu/datasets/twor.2009.zip
unzip twor.2009.zip

or directly in python

from subprocess import call
call(('wget', url));
call(('unzip', zipfile);



In [3]:

    
#from subprocess import call
#call(('wget', url));
#call(('unzip', zipfile));



In [4]:

    
from IPython.display import Image
Image("twor.2009/sensorlayout2.png")









    Out[4]:



In [5]:

    
column_headings = ('date', 'time', 'sensor', 'value', 'annotation', 'state')

df = pd.read_csv(
    filename, 
    delim_whitespace=True,  # Note, the file is delimited by both space and tab characters
    names=column_headings
)

df.head()









    Out[5]:







  
    
      
      date
      time
      sensor
      value
      annotation
      state
    
  
  
    
      0
      2009-02-02
      07:15:16.575809
      M35
      ON
      R1_Bed_to_Toilet
      begin
    
    
      1
      2009-02-02
      07:15:21.408519
      AD1-A
      2.82231
      NaN
      NaN
    
    
      2
      2009-02-02
      07:15:22.532789
      M35
      OFF
      NaN
      NaN
    
    
      3
      2009-02-02
      07:15:23.345479
      M35
      ON
      NaN
      NaN
    
    
      4
      2009-02-02
      07:15:27.529299
      AD1-A
      2.79926
      NaN
      NaN



In [6]:

    
df.columns
#df.sensor









    Out[6]:





Index(['date', 'time', 'sensor', 'value', 'annotation', 'state'], dtype='object')



In [7]:

    
df.dtypes









    Out[7]:





date          object
time          object
sensor        object
value         object
annotation    object
state         object
dtype: object



In [8]:

    
df.time[0]









    Out[8]:





'07:15:16.575809'



In [9]:

    
df['datetime'] = pd.to_datetime(df[['date', 'time']].apply(lambda row: ' '.join(row), axis=1))









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1978         try:
-> 1979             values, tz_parsed = conversion.datetime_to_datetime64(data)
   1980             # If tzaware, these values represent unix timestamps, so we

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-9-69f69d6e76b4> in <module>
----> 1 df['datetime'] = pd.to_datetime(df[['date', 'time']].apply(lambda row: ' '.join(row), axis=1))

~/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

~/anaconda3/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, infer_datetime_format, origin, cache)
    776             result = arg.map(cache_array)
    777         else:
--> 778             values = convert_listlike(arg._values, True, format)
    779             result = arg._constructor(values, index=arg.index, name=arg.name)
    780     elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

~/anaconda3/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    461             errors=errors,
    462             require_iso8601=require_iso8601,
--> 463             allow_object=True,
    464         )
    465 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1982             return values.view("i8"), tz_parsed
   1983         except (ValueError, TypeError):
-> 1984             raise e
   1985 
   1986     if tz_parsed is not None:

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1973             dayfirst=dayfirst,
   1974             yearfirst=yearfirst,
-> 1975             require_iso8601=require_iso8601,
   1976         )
   1977     except ValueError as e:

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

~/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1356         return parser(parserinfo).parse(timestr, **kwargs)
   1357     else:
-> 1358         return DEFAULTPARSER.parse(timestr, **kwargs)
   1359 
   1360 

~/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    652             raise ValueError("String does not contain a date:", timestr)
    653 
--> 654         ret = self._build_naive(res, default)
    655 
    656         if not ignoretz:

~/anaconda3/lib/python3.7/site-packages/dateutil/parser/_parser.py in _build_naive(self, res, default)
   1227                 repl['day'] = monthrange(cyear, cmonth)[1]
   1228 
-> 1229         naive = default.replace(**repl)
   1230 
   1231         if res.weekday is not None and not res.day:

ValueError: year 22009 is out of range

Create a datetime column (currently date and time are separate, and are also strings). The following should achieve this:

df['datetime'] = pd.to_datetime(df[['date', 'time']].apply(lambda row: ' '.join(row), axis=1))

The code above will fail, however, but this is expected behaviour. If we investigate the traceback for the error, we will discover that the final row in the file is from the year 22009. This is a typo, clearly, and it should be from the year 2009, so we will first replace this value, and then parse the dates.



In [10]:

    
#df.ix[df.date.str.startswith('22009'), 'date'] = '2009-02-03'
df.loc[df.date.str.startswith('22009'), 'date'] = '2009-02-03'
df['datetime'] = pd.to_datetime(df[['date', 'time']].apply(lambda row: ' '.join(row), axis=1))



In [11]:

    
df.dtypes









    Out[11]:





date                  object
time                  object
sensor                object
value                 object
annotation            object
state                 object
datetime      datetime64[ns]
dtype: object

By default, the new column is added to the end of the columns. However, since the date and time are now captured by the datetime column, we no longer need the date and time columns. Additionally, we will see how it is useful to have the datetime column as an index variable: this allows us to do time-driven querying which for this dataset will be very useful.



In [12]:

    
df = df[['datetime', 'sensor', 'value', 'annotation', 'state']]

df.set_index('datetime', inplace=True)

df.head()









    Out[12]:







  
    
      
      sensor
      value
      annotation
      state
    
    
      datetime
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      M35
      ON
      R1_Bed_to_Toilet
      begin
    
    
      2009-02-02 07:15:21.408519
      AD1-A
      2.82231
      NaN
      NaN
    
    
      2009-02-02 07:15:22.532789
      M35
      OFF
      NaN
      NaN
    
    
      2009-02-02 07:15:23.345479
      M35
      ON
      NaN
      NaN
    
    
      2009-02-02 07:15:27.529299
      AD1-A
      2.79926
      NaN
      NaN

We can now inspect the unique sensor, activity and value values:



In [13]:

    
df.sensor.unique()









    Out[13]:





array(['M35', 'AD1-A', 'M34', 'M32', 'M33', 'M36', 'M30', 'M29', 'M37',
       'M38', 'M39', 'M41', 'M40', 'M49', 'M46', 'M50', 'M44', 'M47',
       'M48', 'M43', 'M28', 'M42', 'M27', 'M26', 'M25', 'M24', 'M21',
       'M19', 'M18', 'M17', 'M16', 'M15', 'D09', 'M14', 'M13', 'M08',
       'M07', 'M02', 'M45', 'M51', 'D08', 'AD1-B', 'AD1-C', 'D10', 'I03',
       'M06', 'M09', 'M10', 'M01', 'D12', 'L04', 'M23', 'M31', 'D15',
       'D07', 'D14', 'M11', 'M05', 'M03', 'M04', 'M22', 'M12', 'M20',
       'L11', 'D05', 'D03', 'L10', 'L06', 'L13', 'L12', 'L09'],
      dtype=object)



In [14]:

    
df.annotation.unique()









    Out[14]:





array(['R1_Bed_to_Toilet', nan, 'R1_Personal_Hygiene', 'R2_Bed_to_Toilet',
       'Meal_Preparation', 'R2_Personal_Hygiene', 'Watch_TV', 'R1_Work',
       'R2_Work', 'R1_Sleep', 'R2_Sleep', 'Wash_Bathtub', 'Clean',
       'Study'], dtype=object)



In [15]:

    
df.state.unique()









    Out[15]:





array(['begin', nan, 'end'], dtype=object)



In [16]:

    
df.value.unique()









    Out[16]:





array(['ON', '2.82231', 'OFF', ..., '0.122853', '0.101369', '2.80684'],
      dtype=object)

We can see here that the unique values contains both numbers (eg 2.82231) and strings (ON, OFF). This is because the data recorded by all sensors is contained in one column. The next few steps will be to extract the non-numeric (ie categorical) data from the column.

Extracting the categorical data

We can extract the categorical dataframe (noting that no categorical sensor starts with the letter A):

The regular expression ^[^A] returns true for strings that do not begin with A -- the first ^ is the symbol for the start of the sequence, and the second ^ when contained within square brackets is a 'not match' operation.

Since we only need the sensor name and the value, the data frames that we will deal with in this and the next section will slice these columns only (and doesn't consider the annotation columns).



In [17]:

    
categorical_inds = df.sensor.str.match(r"^[^A]")

df_categorical = df.loc[categorical_inds][['sensor', 'value']]

df_categorical.head()









    Out[17]:







  
    
      
      sensor
      value
    
    
      datetime
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      M35
      ON
    
    
      2009-02-02 07:15:22.532789
      M35
      OFF
    
    
      2009-02-02 07:15:23.345479
      M35
      ON
    
    
      2009-02-02 07:15:28.655329
      M34
      ON
    
    
      2009-02-02 07:15:32.574290
      M34
      OFF



In [18]:

    
df_categorical.value.value_counts()









    Out[18]:





ON         64854
OFF        63633
CLOSE        781
OPEN         774
ABSENT        27
PRESENT       23
OFFF           2
OF             2
O              1
ONF            1
Name: value, dtype: int64



In [19]:

    
for val in ('O', 'OF', 'OFFF', 'ONF'):
    df_categorical.loc[df_categorical.value == val, 'value'] = 'OFF';
    
df_categorical.value.value_counts()









    Out[19]:





ON         64854
OFF        63639
CLOSE        781
OPEN         774
ABSENT        27
PRESENT       23
Name: value, dtype: int64

Our ambition is to create a matrix where each column corresponds to the combinations of sensors and values that are availabele in the data. For example, one column would correspond to the state of M35 being ON, and another column will correspond to M35 being OFF. The reason for having two columns to represent the ON and OFF states is that diffferent information may be achieved by the combinations. For example, a sensor turning on may correspond to somebody entering a room, but correspondingly, a sensor turning off may correspond to somebody leaving the room.

We will achieve the matrix representation by creating a new column that has the sensor and value columns concatenated, and then we will use the get_dummies function provided by pandas to create the representat that we desire.



In [20]:

    
df_categorical.loc[:, 'sensor_value'] = df_categorical[['sensor', 'value']].apply(
    lambda row: '{}_{}'.format(*row).lower(), 
    axis=1
)

df_categorical.head()









    Out[20]:







  
    
      
      sensor
      value
      sensor_value
    
    
      datetime
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      M35
      ON
      m35_on
    
    
      2009-02-02 07:15:22.532789
      M35
      OFF
      m35_off
    
    
      2009-02-02 07:15:23.345479
      M35
      ON
      m35_on
    
    
      2009-02-02 07:15:28.655329
      M34
      ON
      m34_on
    
    
      2009-02-02 07:15:32.574290
      M34
      OFF
      m34_off



In [21]:

    
df_categorical_exploded = pd.get_dummies(df_categorical.sensor_value)

df_categorical_exploded.head()









    Out[21]:







  
    
      
      d03_close
      d03_open
      d05_open
      d07_close
      d07_open
      d08_close
      d08_open
      d09_close
      d09_open
      d10_close
      ...
      m47_off
      m47_on
      m48_off
      m48_on
      m49_off
      m49_on
      m50_off
      m50_on
      m51_off
      m51_on
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:22.532789
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:23.345479
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:28.655329
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:32.574290
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 129 columns

And if desired, we can get a matrix form of the data with the values property



In [22]:

    
df_categorical_exploded.values









    Out[22]:





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)



In [23]:

    
df_categorical_exploded['m35_off'].plot(figsize=(10,5));



In [24]:

    
kitchen_columns = ['m{}_on'.format(ii) for ii in (15,16,17,18,19,51)] 

start = datetime(2009, 2, 2, 10) 
end = datetime(2009, 2, 2, 11) 

df_categorical_exploded[(df_categorical_exploded.index > start) & (df_categorical_exploded.index < end)][kitchen_columns].plot(figsize=(10,15), subplots=True);



In [25]:

    
start = datetime(2009, 2, 2, 15) 
end = datetime(2009, 2, 2, 17) 

df_categorical_exploded[(df_categorical_exploded.index > start) & (df_categorical_exploded.index < end)][kitchen_columns].plot(figsize=(10,15), subplots=True);

Numeric columns

We have extracted matrix representation of the categorical data. Now, we will do the same for the numeric data.



In [26]:

    
numeric_inds = df.sensor.str.startswith("A")

df_numeric = df.loc[numeric_inds][['sensor', 'value']]

df_numeric.head()









    Out[26]:







  
    
      
      sensor
      value
    
    
      datetime
      
      
    
  
  
    
      2009-02-02 07:15:21.408519
      AD1-A
      2.82231
    
    
      2009-02-02 07:15:27.529299
      AD1-A
      2.79926
    
    
      2009-02-02 07:48:10.647809
      AD1-A
      2.82754
    
    
      2009-02-02 07:48:16.658869
      AD1-A
      2.7977
    
    
      2009-02-02 07:49:47.980269
      AD1-A
      2.81895



In [27]:

    
np.asarray(df_numeric.value)









    Out[27]:





array(['2.82231', '2.79926', '2.82754', ..., '2.80207', '2.82731',
       '2.80684'], dtype=object)



In [28]:

    
df_numeric.value.astype(float)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-c87b55486268> in <module>
----> 1 df_numeric.value.astype(float)

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

~/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    727     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    728         # Explicit copy, or required since NumPy can't view from / to object.
--> 729         return arr.astype(dtype, copy=True)
    730 
    731     return arr.view(dtype)

ValueError: could not convert string to float: '0.509695F'

Note, however, that since the value data was obtained from file that it is still in string format. We can convert these str data types to floating point data types easily as follows:

df_numeric.value.map(float)

However, if we do this, we will discover that there is one record that holds a circumspect value (0.509695F). This is clearly another mistake in the dataset. We can remove the F from the string easily. It is not difficult to make the next bit of code more robust to other data types (eg by applying regular expressions to the strings), but here we will simply slice up until the last character to remove the F:



In [29]:

    
f_inds = df_numeric.value.str.endswith('F')

df_numeric.loc[f_inds, 'value'] = df_numeric.loc[f_inds, 'value'].str[:-1]

df_numeric.loc[f_inds]









    Out[29]:







  
    
      
      sensor
      value
    
    
      datetime
      
      
    
  
  
    
      2009-02-19 18:35:43.447570
      AD1-B
      0.509695

We can now map all data to floating point numbers



In [30]:

    
df_numeric.value = df_numeric.value.map(float)

There are only three numeric sensor types, as we can see with the unique member function:



In [31]:

    
unique_keys = df_numeric.sensor.unique()

unique_keys









    Out[31]:





array(['AD1-A', 'AD1-B', 'AD1-C'], dtype=object)

Create some new columns for the three sensors (AD1-A, AD1-B, and AD1-C), and merge with the original data frame



In [32]:

    
df_numeric = pd.merge(df_numeric[['value']], pd.get_dummies(df_numeric.sensor), left_index=True, right_index=True)

df_numeric.head()









    Out[32]:







  
    
      
      value
      AD1-A
      AD1-B
      AD1-C
    
    
      datetime
      
      
      
      
    
  
  
    
      2009-02-02 07:15:21.408519
      2.82231
      1
      0
      0
    
    
      2009-02-02 07:15:27.529299
      2.79926
      1
      0
      0
    
    
      2009-02-02 07:48:10.647809
      2.82754
      1
      0
      0
    
    
      2009-02-02 07:48:16.658869
      2.79770
      1
      0
      0
    
    
      2009-02-02 07:49:47.980269
      2.81895
      1
      0
      0



In [33]:

    
for key in unique_keys:
    df_numeric[key] *= df_numeric.value

df_numeric = df_numeric[unique_keys]

# Print a larger sample of the data frame
df_numeric









    Out[33]:







  
    
      
      AD1-A
      AD1-B
      AD1-C
    
    
      datetime
      
      
      
    
  
  
    
      2009-02-02 07:15:21.408519
      2.82231
      0.0
      0.0
    
    
      2009-02-02 07:15:27.529299
      2.79926
      0.0
      0.0
    
    
      2009-02-02 07:48:10.647809
      2.82754
      0.0
      0.0
    
    
      2009-02-02 07:48:16.658869
      2.79770
      0.0
      0.0
    
    
      2009-02-02 07:49:47.980269
      2.81895
      0.0
      0.0
    
    
      ...
      ...
      ...
      ...
    
    
      2009-04-03 18:13:07.618079
      2.79504
      0.0
      0.0
    
    
      2009-04-03 18:13:45.458260
      2.82231
      0.0
      0.0
    
    
      2009-04-03 18:13:48.585030
      2.80207
      0.0
      0.0
    
    
      2009-04-03 18:14:24.679579
      2.82731
      0.0
      0.0
    
    
      2009-04-03 18:14:30.627379
      2.80684
      0.0
      0.0
    
  

7693 rows × 3 columns



In [34]:

    
#df_numeric.value.groupby(df_numeric.sensor).plot(kind='kde', legend=True, figsize=(10,5)) 
df_numeric[unique_keys].plot(kind='kde', legend=True, figsize=(10,5), subplots=True);

Merging categorical and numeric data together

Since we have two dataframes (one for categorical data and one for numeric data). For any analysis, it would be useful to have these two in a unified data frame.

First we will remind ourselves what the dataframes look like, and then these will be unified into one format.



In [35]:

    
df_categorical_exploded.head()









    Out[35]:







  
    
      
      d03_close
      d03_open
      d05_open
      d07_close
      d07_open
      d08_close
      d08_open
      d09_close
      d09_open
      d10_close
      ...
      m47_off
      m47_on
      m48_off
      m48_on
      m49_off
      m49_on
      m50_off
      m50_on
      m51_off
      m51_on
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:22.532789
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:23.345479
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:28.655329
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:15:32.574290
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 129 columns



In [36]:

    
df_numeric.head()









    Out[36]:







  
    
      
      AD1-A
      AD1-B
      AD1-C
    
    
      datetime
      
      
      
    
  
  
    
      2009-02-02 07:15:21.408519
      2.82231
      0.0
      0.0
    
    
      2009-02-02 07:15:27.529299
      2.79926
      0.0
      0.0
    
    
      2009-02-02 07:48:10.647809
      2.82754
      0.0
      0.0
    
    
      2009-02-02 07:48:16.658869
      2.79770
      0.0
      0.0
    
    
      2009-02-02 07:49:47.980269
      2.81895
      0.0
      0.0

We will use the pandas.merge function to join the two dataframes. In this case, we must use more of its functionality. We will merge on the index of the categorical and numeric dataframes. However, since none of these timestamps are shared (refer to the original data frame) we will do the merge with an "outer" join.



In [37]:

    
df_joined = pd.merge(
    df_categorical_exploded, 
    df_numeric, 
    left_index=True, 
    right_index=True,
    how='outer'
)

df_joined.head()









    Out[37]:







  
    
      
      d03_close
      d03_open
      d05_open
      d07_close
      d07_open
      d08_close
      d08_open
      d09_close
      d09_open
      d10_close
      ...
      m48_on
      m49_off
      m49_on
      m50_off
      m50_on
      m51_off
      m51_on
      AD1-A
      AD1-B
      AD1-C
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      NaN
      NaN
      NaN
    
    
      2009-02-02 07:15:21.408519
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2.82231
      0.0
      0.0
    
    
      2009-02-02 07:15:22.532789
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      NaN
      NaN
      NaN
    
    
      2009-02-02 07:15:23.345479
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      NaN
      NaN
      NaN
    
    
      2009-02-02 07:15:27.529299
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2.79926
      0.0
      0.0
    
  

5 rows × 132 columns

Note, that in merging the dataframes, we now have a time-ordered dataframe. This is one of the advantages of using datetimes as the index type in dataframes since pandas will understand precisely how to merge the two datasets.

Annotations

So far we have extracted the categorical and the numeric data from the dataframe. The annotations/labels have not yet been considered. This section will focus on these data.

Since this part of the notebook will focus only on the annotations, the annotation dataframe will slice these columns exclusively.

Finally, because the annotations have begin and end times, we will need to fill in the time between begin and end with 1's. The ideal output therefore will be that we produce a matrix. Each column of the matrix will correspond to a particular activity. The duration of the activity is captured with a series of 1's. This is similar to the techniques we used to create the categorical dataframe, however, in that we only recorded that a sensor turned on and off, and did not track its continuous state.



In [38]:

    
annotation_inds = pd.notnull(df.annotation)

df_annotation = df.loc[annotation_inds][['annotation', 'state']]

# There are some duplicated indices. Remove with
df_annotation = df_annotation.groupby(level=0).first()

df_annotation.head()









    Out[38]:







  
    
      
      annotation
      state
    
    
      datetime
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      R1_Bed_to_Toilet
      begin
    
    
      2009-02-02 07:21:03.792759
      R1_Bed_to_Toilet
      end
    
    
      2009-02-02 07:21:04.014729
      R1_Personal_Hygiene
      begin
    
    
      2009-02-02 07:38:33.946559
      R1_Personal_Hygiene
      end
    
    
      2009-02-02 07:44:12.597919
      R2_Bed_to_Toilet
      begin

It's important to ensure that the expected format of the data is consistent in this dataset. This means that there should be

One way to do this is to group by the annotation label and to print some statistics about the data.



In [39]:

    
for annotation, group in df_annotation.groupby('annotation'): 
    counts = group.state.value_counts()
    
    if counts.begin == counts.end: 
        print('             {}: equal counts ({}, {})'.format(
            annotation, 
            counts.begin, 
            counts.end
        ))
        
    else:
        print(' *** WARNING {}: inconsistent annotation counts with {} begin and {} end'.format(
            annotation, 
            counts.begin, 
            counts.end
        ))









    



             Clean: equal counts (2, 2)
             Meal_Preparation: equal counts (118, 118)
             R1_Bed_to_Toilet: equal counts (34, 34)
 *** WARNING R1_Personal_Hygiene: inconsistent annotation counts with 44 begin and 45 end
             R1_Sleep: equal counts (35, 35)
 *** WARNING R1_Work: inconsistent annotation counts with 59 begin and 60 end
             R2_Bed_to_Toilet: equal counts (39, 39)
 *** WARNING R2_Personal_Hygiene: inconsistent annotation counts with 38 begin and 39 end
             R2_Sleep: equal counts (34, 34)
             R2_Work: equal counts (44, 44)
             Study: equal counts (9, 9)
             Wash_Bathtub: equal counts (1, 1)
             Watch_TV: equal counts (31, 31)

We can see here that two activities have inconsistent numbers of begin and end statements for the activities. Interestingly, they both have more end conditions than begin conditions. In some sense, this is a less critical bug than having more begin statements.

In order to deal with this, we will first look at the affected rows:



In [40]:

    
df_annotation.loc[df_annotation.annotation == 'R1_Work']









    Out[40]:







  
    
      
      annotation
      state
    
    
      datetime
      
      
    
  
  
    
      2009-02-02 14:29:03.195450
      R1_Work
      begin
    
    
      2009-02-02 17:37:22.211110
      R1_Work
      end
    
    
      2009-02-02 19:00:18.038599
      R1_Work
      begin
    
    
      2009-02-02 19:30:06.534640
      R1_Work
      end
    
    
      2009-02-03 17:57:46.718869
      R1_Work
      begin
    
    
      ...
      ...
      ...
    
    
      2009-04-02 17:51:11.816370
      R1_Work
      begin
    
    
      2009-04-02 19:59:07.909409
      R1_Work
      end
    
    
      2009-04-02 22:28:09.071290
      R1_Work
      end
    
    
      2009-04-03 19:33:45.398900
      R1_Work
      begin
    
    
      2009-04-03 20:40:53.145679
      R1_Work
      end
    
  

119 rows × 2 columns

Querying consecutive annotations, we can print the pair of annotations that have



In [41]:

    
def filter_annotations(anns):
    left = iter(anns.index[:-1])
    right = iter(anns.index[1:])

    inds = []
    for ii, (ll, rr) in enumerate(zip(left, right)): 
        try:
            l = anns.loc[ll]
            r = anns.loc[rr]

            if l.state == 'begin' and r.state == 'end': 
                inds.extend([ll, rr])
                
        except ValueError:
            print(ii) 
            print(l)
            print()
            print(r)
            print()
            print() 
            
            asdf

    return anns.loc[inds, :]
        

dfs = []
for annotation, group in df_annotation.groupby('annotation'): 
    print('{:>30} - {}'.format(annotation, group.size))
    dfs.append(filter_annotations(group))









    



                         Clean - 8
              Meal_Preparation - 472
              R1_Bed_to_Toilet - 136
           R1_Personal_Hygiene - 178
                      R1_Sleep - 140
                       R1_Work - 238
              R2_Bed_to_Toilet - 156
           R2_Personal_Hygiene - 154
                      R2_Sleep - 136
                       R2_Work - 176
                         Study - 36
                  Wash_Bathtub - 4
                      Watch_TV - 124

Create the output dataframe



In [42]:

    
df_annotation_exploded = pd.get_dummies(df_annotation.annotation)

df_annotation_exploded.head(50)









    Out[42]:







  
    
      
      Clean
      Meal_Preparation
      R1_Bed_to_Toilet
      R1_Personal_Hygiene
      R1_Sleep
      R1_Work
      R2_Bed_to_Toilet
      R2_Personal_Hygiene
      R2_Sleep
      R2_Work
      Study
      Wash_Bathtub
      Watch_TV
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:21:03.792759
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:21:04.014729
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:38:33.946559
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:44:12.597919
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:46:55.137900
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:47:19.239909
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:55:23.862210
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:19:49.541500
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:41:52.005729
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:51:07.093420
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 11:17:06.621900
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 11:41:34.140719
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-02 12:31:01.550129
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-02 14:29:03.195450
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 15:23:23.042079
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2009-02-02 17:36:44.601619
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2009-02-02 17:37:22.211110
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 17:47:19.486150
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 18:56:56.361779
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 19:00:18.038599
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 19:30:06.534640
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:10:55.426609
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:24:18.955319
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:26:56.912709
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:29:18.643480
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-03 05:57:22.066809
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 06:12:12.996740
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-03 06:12:13.631340
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 06:18:23.875380
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 06:32:54.238749
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:07:53.764689
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:29:33.824249
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:35:47.606040
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 08:06:10.968039
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 17:57:36.991650
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 17:57:46.718869
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 18:01:23.231290
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-03 19:10:13.150899
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 19:10:23.019589
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-03 22:25:44.373600
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 22:37:34.732320
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-04 05:56:31.889199
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:07:12.928620
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-04 06:12:34.172840
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:16:41.641669
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:17:29.700360
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:51:49.793200
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-04 07:01:27.414659
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 07:26:00.365910
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [43]:

    
paired = pd.concat(dfs)

left = paired.index[:-1:2]
right = paired.index[1::2]

print(df_annotation_exploded.mean())
for ll, rr in zip(left, right): 
    l = paired.loc[ll]
    r = paired.loc[rr]
    
    assert l.annotation == r.annotation
    
    annotation = l.annotation
    begin = l.name
    end = r.name
    
    # Another advantage of using datetime index: can slice with time ranges
    df_annotation_exploded.loc[begin:end, annotation] = 1









    



Clean                  0.004086
Meal_Preparation       0.241062
R1_Bed_to_Toilet       0.069459
R1_Personal_Hygiene    0.090909
R1_Sleep               0.071502
R1_Work                0.121553
R2_Bed_to_Toilet       0.079673
R2_Personal_Hygiene    0.078652
R2_Sleep               0.069459
R2_Work                0.089888
Study                  0.018386
Wash_Bathtub           0.002043
Watch_TV               0.063330
dtype: float64



In [44]:

    
df_annotation_exploded.head(50)









    Out[44]:







  
    
      
      Clean
      Meal_Preparation
      R1_Bed_to_Toilet
      R1_Personal_Hygiene
      R1_Sleep
      R1_Work
      R2_Bed_to_Toilet
      R2_Personal_Hygiene
      R2_Sleep
      R2_Work
      Study
      Wash_Bathtub
      Watch_TV
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:21:03.792759
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:21:04.014729
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:38:33.946559
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:44:12.597919
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:46:55.137900
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:47:19.239909
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 07:55:23.862210
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:19:49.541500
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:41:52.005729
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 10:51:07.093420
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 11:17:06.621900
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 11:41:34.140719
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-02 12:31:01.550129
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-02 14:29:03.195450
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 15:23:23.042079
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      2009-02-02 17:36:44.601619
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      2009-02-02 17:37:22.211110
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 17:47:19.486150
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 18:56:56.361779
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 19:00:18.038599
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 19:30:06.534640
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:10:55.426609
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:24:18.955319
      0
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:26:56.912709
      0
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-02 22:29:18.643480
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-03 05:57:22.066809
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-03 06:12:12.996740
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-03 06:12:13.631340
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 06:18:23.875380
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 06:32:54.238749
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:07:53.764689
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:29:33.824249
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 07:35:47.606040
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 08:06:10.968039
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 17:57:36.991650
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 17:57:46.718869
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 18:01:23.231290
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-03 19:10:13.150899
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-03 19:10:23.019589
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2009-02-03 22:25:44.373600
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-03 22:37:34.732320
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-04 05:56:31.889199
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-04 06:07:12.928620
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2009-02-04 06:12:34.172840
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:16:41.641669
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:17:29.700360
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-04 06:51:49.793200
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2009-02-04 07:01:27.414659
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-02-04 07:26:00.365910
      0
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0

Merging the full dataset



In [45]:

    
dataset = pd.merge(
    df_joined, 
    df_annotation_exploded, 
    left_index=True, 
    right_index=True, 
    how='outer'
)

data_cols = df_joined.columns
annotation_cols = df_annotation_exploded.columns



In [46]:

    
dataset[data_cols] = dataset[data_cols].fillna(0)
dataset[annotation_cols] = dataset[annotation_cols].ffill()

dataset.head()









    Out[46]:







  
    
      
      d03_close
      d03_open
      d05_open
      d07_close
      d07_open
      d08_close
      d08_open
      d09_close
      d09_open
      d10_close
      ...
      R1_Personal_Hygiene
      R1_Sleep
      R1_Work
      R2_Bed_to_Toilet
      R2_Personal_Hygiene
      R2_Sleep
      R2_Work
      Study
      Wash_Bathtub
      Watch_TV
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:21.408519
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:22.532789
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:23.345479
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:27.529299
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 145 columns



In [47]:

    
dataset[data_cols].head()









    Out[47]:







  
    
      
      d03_close
      d03_open
      d05_open
      d07_close
      d07_open
      d08_close
      d08_open
      d09_close
      d09_open
      d10_close
      ...
      m48_on
      m49_off
      m49_on
      m50_off
      m50_on
      m51_off
      m51_on
      AD1-A
      AD1-B
      AD1-C
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.00000
      0.0
      0.0
    
    
      2009-02-02 07:15:21.408519
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.82231
      0.0
      0.0
    
    
      2009-02-02 07:15:22.532789
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.00000
      0.0
      0.0
    
    
      2009-02-02 07:15:23.345479
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.00000
      0.0
      0.0
    
    
      2009-02-02 07:15:27.529299
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.79926
      0.0
      0.0
    
  

5 rows × 132 columns



In [48]:

    
dataset[annotation_cols].head()









    Out[48]:







  
    
      
      Clean
      Meal_Preparation
      R1_Bed_to_Toilet
      R1_Personal_Hygiene
      R1_Sleep
      R1_Work
      R2_Bed_to_Toilet
      R2_Personal_Hygiene
      R2_Sleep
      R2_Work
      Study
      Wash_Bathtub
      Watch_TV
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:15:16.575809
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:21.408519
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:22.532789
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:23.345479
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:15:27.529299
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0



In [49]:

    
dataset.loc[dataset.Meal_Preparation == 1][kitchen_columns + ['AD1-A']].head()









    Out[49]:







  
    
      
      m15_on
      m16_on
      m17_on
      m18_on
      m19_on
      m51_on
      AD1-A
    
    
      datetime
      
      
      
      
      
      
      
    
  
  
    
      2009-02-02 07:47:19.239909
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      2009-02-02 07:47:20.857180
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:47:21.147439
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:47:21.426199
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2009-02-02 07:47:24.357180
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	date	time	sensor	value	annotation	state
0	2009-02-02	07:15:16.575809	M35	ON	R1_Bed_to_Toilet	begin
1	2009-02-02	07:15:21.408519	AD1-A	2.82231	NaN	NaN
2	2009-02-02	07:15:22.532789	M35	OFF	NaN	NaN
3	2009-02-02	07:15:23.345479	M35	ON	NaN	NaN
4	2009-02-02	07:15:27.529299	AD1-A	2.79926	NaN	NaN

	sensor	value	annotation	state
datetime
2009-02-02 07:15:16.575809	M35	ON	R1_Bed_to_Toilet	begin
2009-02-02 07:15:21.408519	AD1-A	2.82231	NaN	NaN
2009-02-02 07:15:22.532789	M35	OFF	NaN	NaN
2009-02-02 07:15:23.345479	M35	ON	NaN	NaN
2009-02-02 07:15:27.529299	AD1-A	2.79926	NaN	NaN

	sensor	value	sensor_value
datetime
2009-02-02 07:15:16.575809	M35	ON	m35_on
2009-02-02 07:15:22.532789	M35	OFF	m35_off
2009-02-02 07:15:23.345479	M35	ON	m35_on
2009-02-02 07:15:28.655329	M34	ON	m34_on
2009-02-02 07:15:32.574290	M34	OFF	m34_off

	annotation	state
datetime
2009-02-02 14:29:03.195450	R1_Work	begin
2009-02-02 17:37:22.211110	R1_Work	end
2009-02-02 19:00:18.038599	R1_Work	begin
2009-02-02 19:30:06.534640	R1_Work	end
2009-02-03 17:57:46.718869	R1_Work	begin
...	...	...
2009-04-02 17:51:11.816370	R1_Work	begin
2009-04-02 19:59:07.909409	R1_Work	end
2009-04-02 22:28:09.071290	R1_Work	end
2009-04-03 19:33:45.398900	R1_Work	begin
2009-04-03 20:40:53.145679	R1_Work	end

	Clean	Meal_Preparation	R1_Bed_to_Toilet	R1_Personal_Hygiene	R1_Sleep	R1_Work	R2_Bed_to_Toilet	R2_Personal_Hygiene	R2_Sleep	R2_Work	Study	Wash_Bathtub	Watch_TV
datetime
2009-02-02 07:15:16.575809	0	0	1	0	0	0	0	0	0	0	0	0	0
2009-02-02 07:21:03.792759	0	0	1	0	0	0	0	0	0	0	0	0	0
2009-02-02 07:21:04.014729	0	0	0	1	0	0	0	0	0	0	0	0	0
2009-02-02 07:38:33.946559	0	0	0	1	0	0	0	0	0	0	0	0	0
2009-02-02 07:44:12.597919	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-02 07:46:55.137900	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-02 07:47:19.239909	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 07:55:23.862210	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 10:19:49.541500	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 10:41:52.005729	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 10:51:07.093420	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-02 11:17:06.621900	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-02 11:41:34.140719	0	0	0	0	0	0	0	0	0	0	0	0	1
2009-02-02 12:31:01.550129	0	0	0	0	0	0	0	0	0	0	0	0	1
2009-02-02 14:29:03.195450	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-02 15:23:23.042079	0	0	0	0	0	0	0	0	0	1	0	0	0
2009-02-02 17:36:44.601619	0	0	0	0	0	0	0	0	0	1	0	0	0
2009-02-02 17:37:22.211110	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-02 17:47:19.486150	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 18:56:56.361779	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-02 19:00:18.038599	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-02 19:30:06.534640	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-02 22:10:55.426609	0	0	0	0	1	0	0	0	0	0	0	0	0
2009-02-02 22:24:18.955319	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-02 22:26:56.912709	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-02 22:29:18.643480	0	0	0	0	0	0	0	0	1	0	0	0	0
2009-02-03 05:57:22.066809	0	0	0	0	1	0	0	0	0	0	0	0	0
2009-02-03 06:12:12.996740	0	0	0	0	0	0	0	0	1	0	0	0	0
2009-02-03 06:12:13.631340	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-03 06:18:23.875380	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-03 06:32:54.238749	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-03 07:07:53.764689	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-03 07:29:33.824249	0	0	1	0	0	0	0	0	0	0	0	0	0
2009-02-03 07:35:47.606040	0	0	1	0	0	0	0	0	0	0	0	0	0
2009-02-03 08:06:10.968039	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-03 17:57:36.991650	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-03 17:57:46.718869	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-03 18:01:23.231290	0	0	0	0	0	0	0	0	0	0	0	0	1
2009-02-03 19:10:13.150899	0	0	0	0	0	1	0	0	0	0	0	0	0
2009-02-03 19:10:23.019589	0	0	0	0	0	0	0	0	0	0	0	0	1
2009-02-03 22:25:44.373600	0	0	0	0	1	0	0	0	0	0	0	0	0
2009-02-03 22:37:34.732320	0	0	0	0	0	0	0	0	1	0	0	0	0
2009-02-04 05:56:31.889199	0	0	0	0	1	0	0	0	0	0	0	0	0
2009-02-04 06:07:12.928620	0	0	0	0	0	0	0	0	1	0	0	0	0
2009-02-04 06:12:34.172840	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-04 06:16:41.641669	0	0	0	0	0	0	1	0	0	0	0	0	0
2009-02-04 06:17:29.700360	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-04 06:51:49.793200	0	0	0	0	0	0	0	1	0	0	0	0	0
2009-02-04 07:01:27.414659	0	1	0	0	0	0	0	0	0	0	0	0	0
2009-02-04 07:26:00.365910	0	0	1	0	0	0	0	0	0	0	0	0	0

	d03_close	d03_open	d05_open	d07_close	d07_open	d08_close	d08_open	d09_close	d09_open	d10_close	...	m48_on	m49_off	m49_on	m50_off	m50_on	m51_off	m51_on	AD1-A	AD1-B	AD1-C
datetime
2009-02-02 07:15:16.575809	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.0	0.0
2009-02-02 07:15:21.408519	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.82231	0.0	0.0
2009-02-02 07:15:22.532789	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.0	0.0
2009-02-02 07:15:23.345479	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.0	0.0
2009-02-02 07:15:27.529299	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.79926	0.0	0.0

	m15_on	m16_on	m17_on	m18_on	m19_on	m51_on	AD1-A
datetime
2009-02-02 07:47:19.239909	0.0	0.0	0.0	0.0	1.0	0.0	0.0
2009-02-02 07:47:20.857180	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2009-02-02 07:47:21.147439	0.0	0.0	0.0	1.0	0.0	0.0	0.0
2009-02-02 07:47:21.426199	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2009-02-02 07:47:24.357180	0.0	0.0	0.0	0.0	0.0	0.0	0.0