Step 2: Feature Engineering

According to Wikipedia, Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive.

This Feature engineering notebook will load the data sets created in the Data Ingestion notebook (Code/1_data_ingestion.ipynb) from an Azure storage container and combine them to create a single data set of features (variables) that can be used to infer a machines's health condition over time. The notebook steps through several feature engineering and labeling methods to create this data set for use in our predictive maintenance machine learning solution.

Note: This notebook will take about 20-30 minutes to execute all cells, depending on the compute configuration you have setup.



In [1]:

    
## Setup our environment by importing required libraries
import time
import os
import glob

# Read csv file from URL directly
import pandas as pd

# For creating some preliminary EDA plots.
%matplotlib inline
import matplotlib.pyplot as plt
from ggplot import *

import datetime
from pyspark.sql.functions import to_date

import pyspark.sql.functions as F
from pyspark.sql.functions import col, unix_timestamp, round
from pyspark.sql.functions import datediff
from pyspark.sql.window import Window
from pyspark.sql.types import DoubleType

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer

from pyspark.sql import SparkSession

# For Azure blob storage access
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess

# For logging model evaluation parameters back into the
# AML Workbench run history plots.
import logging
from azureml.logging import get_azureml_logger

amllog = logging.getLogger("azureml")
amllog.level = logging.INFO

# Turn on cell level logging.
%azureml history on
%azureml history show

# Time the notebook execution. 
# This will only make sense if you "Run all cells"
tic = time.time()

logger = get_azureml_logger() # logger writes to AMLWorkbench runtime view

spark = SparkSession.builder.getOrCreate()

# Telemetry
logger.log('amlrealworld.predictivemaintenance.feature_engineering','true')









    



History logging enabled
History logging is enabled






    Out[1]:





<azureml.logging.script_run_request.ScriptRunRequest at 0x7f1b8b9337b8>

Load raw data from Azure Blob storage container

In the Data Ingestion notebook (Code/1_data_ingestion.ipynb), we downloaded, converted and stored the following data sets into Azure blob storage:

Machines: Features differentiating each machine. For example age and model.
Error: The log of non-critical errors. These errors may still indicate an impending component failure.
Maint: Machine maintenance history detailing component replacement or regular maintenance activities withe the date of replacement.
Telemetry: The operating conditions of a machine e.g. data collected from sensors.
Failure: The failure history of a machine or component within the machine.

We first load these files. Since the Azure Blob storage account name and account key are not passed between notebooks, you'll need to provide those here again.



In [2]:

    
# Enter your Azure blob storage details here 
ACCOUNT_NAME = "<your blob storage account name>"

# You can find the account key under the _Access Keys_ link in the 
# [Azure Portal](portal.azure.com) page for your Azure storage container.
ACCOUNT_KEY = "<your blob storage account key>"

#-------------------------------------------------------------------------------------------
# The data from the Data Aquisition note book is stored in the dataingestion container.
CONTAINER_NAME = "dataingestion"

# The data constructed in this notebook will be stored in the featureengineering container
STORAGE_CONTAINER_NAME = "featureengineering"

# Connect to your blob service     
az_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# We will store each of these data sets in blob storage in an 
# Azure Storage Container on your Azure subscription.
# See https://github.com/Azure/ViennaDocs/blob/master/Documentation/UsingBlobForStorage.md
# for details.

# These file names detail which blob each file is stored under. 
MACH_DATA = 'machines_files.parquet'
MAINT_DATA = 'maint_files.parquet'
ERROR_DATA = 'errors_files.parquet'
TELEMETRY_DATA = 'telemetry_files.parquet'
FAILURE_DATA = 'failure_files.parquet'

# These file names detail the local paths where we store the data results.
MACH_LOCAL_DIRECT = 'dataingestion_mach_result.parquet'
ERROR_LOCAL_DIRECT = 'dataingestion_err_result.parquet'
MAINT_LOCAL_DIRECT = 'dataingestion_maint_result.parquet'
TELEMETRY_LOCAL_DIRECT = 'dataingestion_tel_result.parquet'
FAILURES_LOCAL_DIRECT = 'dataingestion_fail_result.parquet'

# This is the final data file.
FEATURES_LOCAL_DIRECT = 'featureengineering_files.parquet'

Machines data set

Now, we load the machines data set from your Azure blob.



In [3]:

    
# create a local path  to store the data.
if not os.path.exists(MACH_LOCAL_DIRECT):
    os.makedirs(MACH_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if MACH_DATA in blob.name:
        local_file = os.path.join(MACH_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
machines = spark.read.parquet(MACH_LOCAL_DIRECT)

print(machines.count())
machines.limit(5).toPandas().head()









    



DONE creating a local directory!
1000






    Out[3]:







  
    
      
      machineID
      model
      age
    
  
  
    
      0
      1
      model2
      18
    
    
      1
      2
      model4
      7
    
    
      2
      3
      model3
      8
    
    
      3
      4
      model3
      7
    
    
      4
      5
      model2
      2

Errors data set

Load the errors data set from your Azure blob.



In [4]:

    
if not os.path.exists(ERROR_LOCAL_DIRECT):
    os.makedirs(ERROR_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if ERROR_DATA in blob.name:
        local_file = os.path.join(ERROR_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
errors = spark.read.parquet(ERROR_LOCAL_DIRECT)

print(errors.count())
errors.printSchema()
errors.limit(5).toPandas().head()









    



DONE creating a local directory!
11967
root
 |-- datetime: string (nullable = true)
 |-- machineID: long (nullable = true)
 |-- errorID: string (nullable = true)







    Out[4]:







  
    
      
      datetime
      machineID
      errorID
    
  
  
    
      0
      2015-10-04 06:00:00
      163
      error2
    
    
      1
      2015-10-04 06:00:00
      163
      error3
    
    
      2
      2015-10-06 18:00:00
      163
      error5
    
    
      3
      2015-11-03 06:00:00
      163
      error2
    
    
      4
      2015-11-03 06:00:00
      163
      error3

Maintenance data set

Load the maintenance data set from your Azure blob.



In [5]:

    
# create a local path  to store the data.
if not os.path.exists(MAINT_LOCAL_DIRECT):
    os.makedirs(MAINT_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if MAINT_DATA in blob.name:
        local_file = os.path.join(MAINT_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
maint = spark.read.parquet(MAINT_LOCAL_DIRECT)

print(maint.count())
maint.limit(5).toPandas().head()









    



DONE creating a local directory!
32592






    Out[5]:







  
    
      
      datetime
      machineID
      comp
    
  
  
    
      0
      2015-10-12 06:00:00
      316
      comp1
    
    
      1
      2015-10-27 06:00:00
      316
      comp2
    
    
      2
      2015-11-11 06:00:00
      316
      comp3
    
    
      3
      2015-11-26 06:00:00
      316
      comp2
    
    
      4
      2015-12-11 06:00:00
      316
      comp1

Telemetry

Load the telemetry data set from your Azure blob.



In [6]:

    
# create a local path  to store the data.
if not os.path.exists(TELEMETRY_LOCAL_DIRECT):
    os.makedirs(TELEMETRY_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if TELEMETRY_DATA in blob.name:
        local_file = os.path.join(TELEMETRY_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
telemetry = spark.read.parquet(TELEMETRY_LOCAL_DIRECT)

print(telemetry.count())
telemetry.limit(5).toPandas().head()









    



DONE creating a local directory!
8761000






    Out[6]:







  
    
      
      datetime
      machineID
      volt
      rotate
      pressure
      vibration
    
  
  
    
      0
      2015-05-07 17:00:00
      334
      179.166785
      358.163062
      98.472561
      41.881622
    
    
      1
      2015-05-07 18:00:00
      334
      212.375587
      352.844237
      97.050602
      38.222228
    
    
      2
      2015-05-07 19:00:00
      334
      153.863807
      327.330970
      89.030084
      38.443395
    
    
      3
      2015-05-07 20:00:00
      334
      156.149709
      382.273117
      106.714794
      42.841213
    
    
      4
      2015-05-07 21:00:00
      334
      179.393273
      350.117291
      106.001312
      35.740815

Failures data set

Load the failures data set from your Azure blob.



In [7]:

    
# create a local path  to store the data.
if not os.path.exists(FAILURES_LOCAL_DIRECT):
    os.makedirs(FAILURES_LOCAL_DIRECT)
    print('DONE creating a local directory!')


# download the entire parquet result folder to local path for a new run 
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if FAILURE_DATA in blob.name:
        local_file = os.path.join(FAILURES_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

failures = spark.read.parquet(FAILURES_LOCAL_DIRECT).dropDuplicates(['machineID', 'datetime'])

print(failures.count())
failures.limit(5).toPandas().head()









    



DONE creating a local directory!
6368






    Out[7]:







  
    
      
      datetime
      machineID
      failure
    
  
  
    
      0
      2015-07-31 06:00:00
      7
      comp1
    
    
      1
      2015-02-17 06:00:00
      179
      comp4
    
    
      2
      2015-12-28 06:00:00
      191
      comp1
    
    
      3
      2015-06-13 06:00:00
      221
      comp2
    
    
      4
      2015-06-28 06:00:00
      262
      comp2

Feature engineering

Our feature engineering will combine the different data sources together to create a single data set of features (variables) that can be used to infer a machines's health condition over time. The ultimate goal is to generate a single record for each time unit within each asset. The record combines features and labels to be fed into the machine learning algorithm.

Predictive maintenance take historical data, marked with a timestamp, to predict current health of a component and the probability of failure within some future window of time. These problems can be characterised as a classification method involving time series data. Time series, since we want to use historical observations to predict what will happen in the future. Classification, because we classify the future as having a probability of failure.

Lag features

There are many ways of creating features from the time series data. We start by dividing the duration of data collection into time units where each record belongs to a single point in time for each asset. The measurement unit for is in fact arbitrary. Time can be in seconds, minutes, hours, days, or months, or it can be measured in cycles, miles or transactions. The measurement choice is typically specific to the use case domain.

Additionally, the time unit does not have to be the same as the frequency of data collection. For example, if temperature values were being collected every 10 seconds, picking a time unit of 10 seconds for analysis may inflate the number of examples without providing any additional information if the temperature changes slowly. A better strategy may be to average the temperature over a longer time horizon which might better capture variations that contribute to the target outcome.

Once we set the frequency of observations, we want to look for trends within measures, over time, in order to predict performance degradation, which we would like to connect to how likely a component will fail. We create features for these trends within each record using time lags over previous observations to check for these performance changes. The lag window size $W$ is a hyper parameter that we can optimize. The following figures indicate a rolling aggregate window strategy for averaging a measure $t_i$ over a window $W = 3$ previous observations.

We are note constrained to averages, we can roll aggregates over counts, average, the standard deviation, outliers based on standard deviations, CUSUM measures, minimum and maximum values for the window.

We could also use a tumbling window approach, if we were interested in a different time window measure than the frequncy of the observations. For example, we might have obersvations evert 6 or 12 hours, but want to create features aligned on a day or week basis.

In the following sections, we will build our features using only a rolling strategy to demonstrate the process. We align our data, and then build features along those normalized observations times. We start with the telemetry data.

Telemetry features

Because the telemetry data set is the largest time series data we have, we start feature engineering here. The telemetry data has 8761000 hourly observations for out 1000 machines. We can improve the model performance by aligning our data by aggregating average sensor measures on a tumbling 12 hour window. In this case we replace the raw data with the tumbling window data, reducing the sensor data to 731000 observations. This will directly reduce the computaton time required to do the feature engineering, labeling and modeling required for our solution.

Once we have the reduced data, we set up our lag features by compute rolling aggregate measures such as mean, standard deviation, minimum, maximum, etc. to represent the short term history of the telemetry over time.

The following code blocks alignes the data on 12 hour observations and calculates a rolling mean and standard deviation of the telemetry data over the last 12, 24 and 36 hour lags.



In [8]:

    
# rolling mean and standard deviation
# Temporary storage for rolling means
tel_mean = telemetry

# Which features are we interested in telemetry data set
rolling_features = ['volt','rotate', 'pressure', 'vibration']
      
# n hours = n * 3600 seconds  
time_val = 12 * 3600

# Choose the time_val hour timestamps to align the data
# dt_truncated looks at the column named "datetime" in the current data set.
# remember that Spark is lazy... this doesn't execute until it is in a withColumn statement.
dt_truncated = ((round(unix_timestamp(col("datetime")) / time_val) * time_val).cast("timestamp"))



In [9]:

    
# We choose windows for our rolling windows 12hrs, 24 hrs and 36 hrs
lags = [12, 24, 36]

# align the data
for lag_n in lags:
    wSpec = Window.partitionBy('machineID').orderBy('datetime').rowsBetween(1-lag_n, 0)
    for col_name in rolling_features:
        tel_mean = tel_mean.withColumn(col_name+'_rollingmean_'+str(lag_n), 
                                       F.avg(col(col_name)).over(wSpec))
        tel_mean = tel_mean.withColumn(col_name+'_rollingstd_'+str(lag_n), 
                                       F.stddev(col(col_name)).over(wSpec))

# Calculate lag values...
telemetry_feat = (tel_mean.withColumn("dt_truncated", dt_truncated)
                  .drop('volt', 'rotate', 'pressure', 'vibration')
                  .fillna(0)
                  .groupBy("machineID","dt_truncated")
                  .agg(F.mean('volt_rollingmean_12').alias('volt_rollingmean_12'),
                       F.mean('rotate_rollingmean_12').alias('rotate_rollingmean_12'), 
                       F.mean('pressure_rollingmean_12').alias('pressure_rollingmean_12'), 
                       F.mean('vibration_rollingmean_12').alias('vibration_rollingmean_12'), 
                       F.mean('volt_rollingmean_24').alias('volt_rollingmean_24'),
                       F.mean('rotate_rollingmean_24').alias('rotate_rollingmean_24'), 
                       F.mean('pressure_rollingmean_24').alias('pressure_rollingmean_24'), 
                       F.mean('vibration_rollingmean_24').alias('vibration_rollingmean_24'),
                       F.mean('volt_rollingmean_36').alias('volt_rollingmean_36'),
                       F.mean('vibration_rollingmean_36').alias('vibration_rollingmean_36'),
                       F.mean('rotate_rollingmean_36').alias('rotate_rollingmean_36'), 
                       F.mean('pressure_rollingmean_36').alias('pressure_rollingmean_36'), 
                       F.stddev('volt_rollingstd_12').alias('volt_rollingstd_12'),
                       F.stddev('rotate_rollingstd_12').alias('rotate_rollingstd_12'), 
                       F.stddev('pressure_rollingstd_12').alias('pressure_rollingstd_12'), 
                       F.stddev('vibration_rollingstd_12').alias('vibration_rollingstd_12'), 
                       F.stddev('volt_rollingstd_24').alias('volt_rollingstd_24'),
                       F.stddev('rotate_rollingstd_24').alias('rotate_rollingstd_24'), 
                       F.stddev('pressure_rollingstd_24').alias('pressure_rollingstd_24'), 
                       F.stddev('vibration_rollingstd_24').alias('vibration_rollingstd_24'),
                       F.stddev('volt_rollingstd_36').alias('volt_rollingstd_36'),
                       F.stddev('rotate_rollingstd_36').alias('rotate_rollingstd_36'), 
                       F.stddev('pressure_rollingstd_36').alias('pressure_rollingstd_36'), 
                       F.stddev('vibration_rollingstd_36').alias('vibration_rollingstd_36'), ))

print(telemetry_feat.count())
telemetry_feat.where((col("machineID") == 1)).limit(10).toPandas().head(10)









    



731000






    Out[9]:







  
    
      
      machineID
      dt_truncated
      volt_rollingmean_12
      rotate_rollingmean_12
      pressure_rollingmean_12
      vibration_rollingmean_12
      volt_rollingmean_24
      rotate_rollingmean_24
      pressure_rollingmean_24
      vibration_rollingmean_24
      ...
      pressure_rollingstd_12
      vibration_rollingstd_12
      volt_rollingstd_24
      rotate_rollingstd_24
      pressure_rollingstd_24
      vibration_rollingstd_24
      volt_rollingstd_36
      rotate_rollingstd_36
      pressure_rollingstd_36
      vibration_rollingstd_36
    
  
  
    
      0
      1
      2015-04-02 12:00:00
      172.270947
      453.797903
      100.483543
      39.492142
      171.338531
      447.862397
      100.398683
      39.738176
      ...
      0.553654
      0.759006
      1.007532
      3.219369
      0.223788
      0.157623
      0.629986
      0.722444
      0.481365
      0.088545
    
    
      1
      1
      2015-05-12 00:00:00
      171.869404
      453.529162
      100.214013
      40.171795
      171.649903
      458.103907
      101.813178
      40.830119
      ...
      1.950677
      0.954435
      0.732750
      2.007912
      1.015522
      0.472419
      0.561696
      0.988960
      0.580570
      0.306208
    
    
      2
      1
      2015-07-26 12:00:00
      167.325797
      450.548728
      96.380156
      41.093641
      168.417015
      446.610707
      97.754618
      39.830570
      ...
      0.948548
      0.460620
      0.834651
      1.777865
      0.496723
      0.181640
      0.458351
      1.382999
      0.176032
      0.107555
    
    
      3
      1
      2015-08-04 00:00:00
      170.231819
      482.393571
      98.709572
      39.771546
      171.876957
      473.301415
      103.358692
      43.025796
      ...
      1.007066
      0.634358
      0.546013
      1.609741
      1.535208
      0.333413
      0.546630
      1.585655
      0.186535
      0.133264
    
    
      4
      1
      2015-08-23 12:00:00
      170.203712
      425.094036
      101.529310
      39.173365
      167.913767
      447.531697
      100.026931
      38.672964
      ...
      1.640248
      0.397124
      0.951870
      4.648390
      1.638073
      0.297089
      0.823227
      2.742029
      0.467837
      0.189906
    
    
      5
      1
      2015-12-17 00:00:00
      169.167041
      455.734514
      98.842483
      39.104152
      171.370731
      452.306194
      98.278780
      42.471398
      ...
      1.031176
      0.801300
      0.874356
      2.523363
      1.116471
      0.800095
      0.505470
      2.272075
      0.260917
      0.302917
    
    
      6
      1
      2015-02-16 12:00:00
      167.881679
      459.764609
      95.996968
      41.355194
      173.957011
      452.269318
      99.137917
      40.716012
      ...
      1.038664
      0.734094
      1.594118
      4.929011
      0.532804
      0.139801
      1.130919
      3.430454
      0.376233
      0.134925
    
    
      7
      1
      2015-10-30 12:00:00
      171.302151
      458.251722
      100.603054
      47.070461
      169.921885
      457.776886
      98.524571
      44.258167
      ...
      1.684044
      1.263139
      0.323506
      3.108947
      0.426392
      0.751400
      0.352569
      1.753114
      0.381185
      0.762853
    
    
      8
      1
      2015-10-31 12:00:00
      172.983326
      409.865155
      96.193019
      51.727068
      170.658641
      424.695800
      96.454905
      51.746030
      ...
      1.236143
      0.363418
      0.370740
      11.858567
      0.182346
      0.205475
      0.300758
      6.649440
      0.380280
      0.680251
    
    
      9
      1
      2015-12-11 12:00:00
      173.352547
      448.775731
      101.412437
      39.027765
      171.292002
      444.774327
      101.879997
      39.588702
      ...
      1.520909
      0.524005
      0.491613
      2.037811
      0.483161
      0.292491
      0.961163
      2.095341
      0.306276
      0.286464
    
  

10 rows × 26 columns

Errors features

Like telemetry data, errors come with timestamps. An important difference is that the error IDs are categorical values and should not be averaged over time intervals like the telemetry measurements. Instead, we count the number of errors of each type within a lag window.

Again, we align the error counts data by tumbling over the 12 hour window using a join with telemetry data.



In [10]:

    
# create a column for each errorID 
error_ind = (errors.groupBy("machineID","datetime","errorID").pivot('errorID')
             .agg(F.count('machineID').alias('dummy')).drop('errorID').fillna(0)
             .groupBy("machineID","datetime")
             .agg(F.sum('error1').alias('error1sum'), 
                  F.sum('error2').alias('error2sum'), 
                  F.sum('error3').alias('error3sum'), 
                  F.sum('error4').alias('error4sum'), 
                  F.sum('error5').alias('error5sum')))

# join the telemetry data with errors
error_count = (telemetry.join(error_ind, 
                              ((telemetry['machineID'] == error_ind['machineID']) 
                               & (telemetry['datetime'] == error_ind['datetime'])), "left")
               .drop('volt', 'rotate', 'pressure', 'vibration')
               .drop(error_ind.machineID).drop(error_ind.datetime)
               .fillna(0))

error_features = ['error1sum','error2sum', 'error3sum', 'error4sum', 'error5sum']

wSpec = Window.partitionBy('machineID').orderBy('datetime').rowsBetween(1-24, 0)
for col_name in error_features:
    # We're only interested in the erros in the previous 24 hours.
    error_count = error_count.withColumn(col_name+'_rollingmean_24', 
                                         F.avg(col(col_name)).over(wSpec))

error_feat = (error_count.withColumn("dt_truncated", dt_truncated)
              .drop('error1sum', 'error2sum', 'error3sum', 'error4sum', 'error5sum').fillna(0)
              .groupBy("machineID","dt_truncated")
              .agg(F.mean('error1sum_rollingmean_24').alias('error1sum_rollingmean_24'), 
                   F.mean('error2sum_rollingmean_24').alias('error2sum_rollingmean_24'), 
                   F.mean('error3sum_rollingmean_24').alias('error3sum_rollingmean_24'), 
                   F.mean('error4sum_rollingmean_24').alias('error4sum_rollingmean_24'), 
                   F.mean('error5sum_rollingmean_24').alias('error5sum_rollingmean_24')))

print(error_feat.count())
error_feat.limit(10).toPandas().head(10)









    



731000






    Out[10]:







  
    
      
      machineID
      dt_truncated
      error1sum_rollingmean_24
      error2sum_rollingmean_24
      error3sum_rollingmean_24
      error4sum_rollingmean_24
      error5sum_rollingmean_24
    
  
  
    
      0
      26
      2015-04-27 12:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      26
      2015-06-28 12:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      26
      2015-08-11 00:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      29
      2015-03-04 00:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      29
      2015-04-06 00:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      29
      2015-05-14 12:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      6
      29
      2015-06-20 12:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      7
      29
      2015-10-16 00:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      8
      474
      2015-05-29 12:00:00
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      9
      474
      2015-09-15 00:00:00
      0.0
      0.0
      0.0
      0.0
      0.0

Days since last replacement from maintenance

A crucial data set in this example is the use of maintenance records, which contain the information regarding component replacement. Possible features from this data set can be the number of replacements of each component over time or to calculate how long it has been since a component has been replaced. Replacement time is expected to correlate better with component failures since the longer a component is used, the more degradation would be expected.

As a side note, creating lagging features from maintenance data is not straight forward. This type of ad-hoc feature engineering is very common in predictive maintenance as domain knowledge plays a crucial role in understanding the predictors of a failure problem. In the following code blocks, the days since last component replacement are calculated for each component from the maintenance data. We start by counting the component replacements for the set of machines.



In [11]:

    
# create a column for each component replacement
maint_replace = (maint.groupBy("machineID","datetime","comp").pivot('comp')
                 .agg(F.count('machineID').alias('dummy')).fillna(0)
                 .groupBy("machineID","datetime")
                 .agg(F.sum('comp1').alias('comp1sum'), 
                      F.sum('comp2').alias('comp2sum'), 
                      F.sum('comp3').alias('comp3sum'),
                      F.sum('comp4').alias('comp4sum')))

maint_replace = maint_replace.withColumnRenamed('datetime','datetime_maint')

print(maint_replace.count())
maint_replace.limit(10).toPandas().head(10)









    



25121






    Out[11]:







  
    
      
      machineID
      datetime_maint
      comp1sum
      comp2sum
      comp3sum
      comp4sum
    
  
  
    
      0
      567
      2015-09-17 06:00:00
      0
      0
      0
      1
    
    
      1
      191
      2015-12-28 06:00:00
      1
      0
      0
      0
    
    
      2
      301
      2015-04-04 06:00:00
      1
      1
      0
      0
    
    
      3
      852
      2015-06-14 06:00:00
      0
      0
      1
      0
    
    
      4
      942
      2015-09-23 06:00:00
      0
      0
      1
      1
    
    
      5
      66
      2015-09-30 06:00:00
      0
      0
      1
      0
    
    
      6
      370
      2015-07-22 06:00:00
      0
      1
      0
      1
    
    
      7
      512
      2015-05-03 06:00:00
      0
      0
      1
      0
    
    
      8
      427
      2015-06-15 06:00:00
      0
      0
      1
      1
    
    
      9
      490
      2015-10-26 06:00:00
      1
      1
      0
      0

Replacement features are then created by tracking the number of days between each component replacement. We'll repeat these calculations for each of the four components and join them together into a maintenance feature table.

First component number 1 (comp1):



In [12]:

    
# We want to align the component information on telemetry features timestamps.
telemetry_times = (telemetry_feat.select(telemetry_feat.machineID, telemetry_feat.dt_truncated)
                   .withColumnRenamed('dt_truncated','datetime_tel'))

# Grab component 1 records
maint_comp1 = (maint_replace.where(col("comp1sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp2sum', 'comp3sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp1 = (telemetry_times.join(maint_comp1, 
                                        ((telemetry_times ['machineID']== maint_comp1['machineID']) 
                                         & (telemetry_times ['datetime_tel'] > maint_comp1['datetime_maint']) 
                                         & ( maint_comp1['comp1sum'] == '1')))
                   .drop(maint_comp1.machineID))

# Calculate the number of days between replacements
comp1 = (maint_tel_comp1.withColumn("sincelastcomp1", 
                                    datediff(maint_tel_comp1.datetime_tel, maint_tel_comp1.datetime_maint))
         .drop(maint_tel_comp1.datetime_maint).drop(maint_tel_comp1.comp1sum))

print(comp1.count())
comp1.filter(comp1.machineID == '625').orderBy(comp1.datetime_tel).limit(20).toPandas().head(20)









    



3254437






    Out[12]:







  
    
      
      machineID
      datetime_tel
      sincelastcomp1
    
  
  
    
      0
      625
      2015-01-01 12:00:00
      94
    
    
      1
      625
      2015-01-02 00:00:00
      95
    
    
      2
      625
      2015-01-02 12:00:00
      95
    
    
      3
      625
      2015-01-03 00:00:00
      96
    
    
      4
      625
      2015-01-03 12:00:00
      96
    
    
      5
      625
      2015-01-04 00:00:00
      97
    
    
      6
      625
      2015-01-04 12:00:00
      97
    
    
      7
      625
      2015-01-05 00:00:00
      98
    
    
      8
      625
      2015-01-05 12:00:00
      98
    
    
      9
      625
      2015-01-06 00:00:00
      99
    
    
      10
      625
      2015-01-06 12:00:00
      99
    
    
      11
      625
      2015-01-07 00:00:00
      100
    
    
      12
      625
      2015-01-07 12:00:00
      100
    
    
      13
      625
      2015-01-08 00:00:00
      101
    
    
      14
      625
      2015-01-08 12:00:00
      101
    
    
      15
      625
      2015-01-09 00:00:00
      102
    
    
      16
      625
      2015-01-09 12:00:00
      102
    
    
      17
      625
      2015-01-10 00:00:00
      103
    
    
      18
      625
      2015-01-10 12:00:00
      103
    
    
      19
      625
      2015-01-11 00:00:00
      104

Then component 2 (comp2):



In [13]:

    
# Grab component 2 records
maint_comp2 = (maint_replace.where(col("comp2sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp3sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp2 = (telemetry_times.join(maint_comp2, 
                                        ((telemetry_times ['machineID']== maint_comp2['machineID']) 
                                         & (telemetry_times ['datetime_tel'] > maint_comp2['datetime_maint']) 
                                         & ( maint_comp2['comp2sum'] == '1')))
                   .drop(maint_comp2.machineID))

# Calculate the number of days between replacements
comp2 = (maint_tel_comp2.withColumn("sincelastcomp2", 
                                    datediff(maint_tel_comp2.datetime_tel, maint_tel_comp2.datetime_maint))
         .drop(maint_tel_comp2.datetime_maint).drop(maint_tel_comp2.comp2sum))

print(comp2.count())
comp2.filter(comp2.machineID == '625').orderBy(comp2.datetime_tel).limit(5).toPandas().head(5)









    



3278730






    Out[13]:







  
    
      
      machineID
      datetime_tel
      sincelastcomp2
    
  
  
    
      0
      625
      2015-01-01 12:00:00
      19
    
    
      1
      625
      2015-01-02 00:00:00
      20
    
    
      2
      625
      2015-01-02 12:00:00
      20
    
    
      3
      625
      2015-01-03 00:00:00
      21
    
    
      4
      625
      2015-01-03 12:00:00
      21

Then component 3 (comp3):



In [14]:

    
# Grab component 3 records
maint_comp3 = (maint_replace.where(col("comp3sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp2sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp3 = (telemetry_times.join(maint_comp3, ((telemetry_times ['machineID']==maint_comp3['machineID']) 
                                                      & (telemetry_times ['datetime_tel'] > maint_comp3['datetime_maint']) 
                                                      & ( maint_comp3['comp3sum'] == '1')))
                   .drop(maint_comp3.machineID))

# Calculate the number of days between replacements
comp3 = (maint_tel_comp3.withColumn("sincelastcomp3", 
                                    datediff(maint_tel_comp3.datetime_tel, maint_tel_comp3.datetime_maint))
         .drop(maint_tel_comp3.datetime_maint).drop(maint_tel_comp3.comp3sum))


print(comp3.count())
comp3.filter(comp3.machineID == '625').orderBy(comp3.datetime_tel).limit(5).toPandas().head(5)









    



3345413






    Out[14]:







  
    
      
      machineID
      datetime_tel
      sincelastcomp3
    
  
  
    
      0
      625
      2015-01-01 12:00:00
      19
    
    
      1
      625
      2015-01-02 00:00:00
      20
    
    
      2
      625
      2015-01-02 12:00:00
      20
    
    
      3
      625
      2015-01-03 00:00:00
      21
    
    
      4
      625
      2015-01-03 12:00:00
      21

and component 4 (comp4):



In [15]:

    
# Grab component 4 records
maint_comp4 = (maint_replace.where(col("comp4sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp2sum', 'comp3sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp4 = telemetry_times.join(maint_comp4, ((telemetry_times['machineID']==maint_comp4['machineID']) 
                                                     & (telemetry_times['datetime_tel'] > maint_comp4['datetime_maint']) 
                                                     & (maint_comp4['comp4sum'] == '1'))).drop(maint_comp4.machineID)

# Calculate the number of days between replacements
comp4 = (maint_tel_comp4.withColumn("sincelastcomp4", 
                                    datediff(maint_tel_comp4.datetime_tel, maint_tel_comp4.datetime_maint))
         .drop(maint_tel_comp4.datetime_maint).drop(maint_tel_comp4.comp4sum))

print(comp4.count())
comp4.filter(comp4.machineID == '625').orderBy(comp4.datetime_tel).limit(5).toPandas().head(5)









    



3273666






    Out[15]:







  
    
      
      machineID
      datetime_tel
      sincelastcomp4
    
  
  
    
      0
      625
      2015-01-01 12:00:00
      139
    
    
      1
      625
      2015-01-02 00:00:00
      140
    
    
      2
      625
      2015-01-02 12:00:00
      140
    
    
      3
      625
      2015-01-03 00:00:00
      141
    
    
      4
      625
      2015-01-03 12:00:00
      141

Now, we join the four component replacement tables together. Once joined, align the data by tumbling the average across 12 hour observation windows.



In [16]:

    
# Join component 3 and 4
comp3_4 = (comp3.join(comp4, ((comp3['machineID'] == comp4['machineID']) 
                              & (comp3['datetime_tel'] == comp4['datetime_tel'])), "left")
           .drop(comp4.machineID).drop(comp4.datetime_tel))

# Join component 2 to 3 and 4
comp2_3_4 = (comp2.join(comp3_4, ((comp2['machineID'] == comp3_4['machineID']) 
                                  & (comp2['datetime_tel'] == comp3_4['datetime_tel'])), "left")
             .drop(comp3_4.machineID).drop(comp3_4.datetime_tel))

# Join component 1 to 2, 3 and 4
comps_feat = (comp1.join(comp2_3_4, ((comp1['machineID'] == comp2_3_4['machineID']) 
                                      & (comp1['datetime_tel'] == comp2_3_4['datetime_tel'])), "left")
               .drop(comp2_3_4.machineID).drop(comp2_3_4.datetime_tel)
               .groupBy("machineID", "datetime_tel")
               .agg(F.max('sincelastcomp1').alias('sincelastcomp1'), 
                    F.max('sincelastcomp2').alias('sincelastcomp2'), 
                    F.max('sincelastcomp3').alias('sincelastcomp3'), 
                    F.max('sincelastcomp4').alias('sincelastcomp4'))
               .fillna(0))

# Choose the time_val hour timestamps to align the data
dt_truncated = ((round(unix_timestamp(col("datetime_tel")) / time_val) * time_val).cast("timestamp"))

# Collect data
maint_feat = (comps_feat.withColumn("dt_truncated", dt_truncated)
              .groupBy("machineID","dt_truncated")
              .agg(F.mean('sincelastcomp1').alias('comp1sum'), 
                   F.mean('sincelastcomp2').alias('comp2sum'), 
                   F.mean('sincelastcomp3').alias('comp3sum'), 
                   F.mean('sincelastcomp4').alias('comp4sum')))

print(maint_feat.count())
maint_feat.limit(10).toPandas().head(10)









    



731000






    Out[16]:







  
    
      
      machineID
      dt_truncated
      comp1sum
      comp2sum
      comp3sum
      comp4sum
    
  
  
    
      0
      1
      2015-04-02 12:00:00
      200.0
      200.0
      140.0
      275.0
    
    
      1
      1
      2015-05-12 00:00:00
      240.0
      240.0
      180.0
      315.0
    
    
      2
      1
      2015-07-26 12:00:00
      315.0
      315.0
      255.0
      390.0
    
    
      3
      1
      2015-08-04 00:00:00
      324.0
      324.0
      264.0
      399.0
    
    
      4
      1
      2015-08-23 12:00:00
      343.0
      343.0
      283.0
      418.0
    
    
      5
      1
      2015-12-17 00:00:00
      459.0
      459.0
      399.0
      534.0
    
    
      6
      2
      2015-05-31 00:00:00
      334.0
      199.0
      229.0
      304.0
    
    
      7
      2
      2015-08-23 00:00:00
      418.0
      283.0
      313.0
      388.0
    
    
      8
      2
      2015-09-14 12:00:00
      440.0
      305.0
      335.0
      410.0
    
    
      9
      3
      2015-02-17 12:00:00
      261.0
      261.0
      201.0
      156.0

Machine features

The machine features capture specifics of the individuals. These can be used without further modification since it include descriptive information about the type of each machine and its age (number of years in service). If the age information had been recorded as a "first use date" for each machine, a transformation would have been necessary to turn those into a numeric values indicating the years in service.

We do need to create a set of dummy features, a set of boolean variables, to indicate the model of the machine. This can either be done manually, or using a one-hot encoding step. We use the one-hot encoding for demonstration purposes.



In [17]:

    
# one hot encoding of the variable model, basically creates a set of dummy boolean variables
catVarNames = ['model']  
sIndexers = [StringIndexer(inputCol=x, outputCol=x + '_indexed') for x in catVarNames]
machines_cat = Pipeline(stages=sIndexers).fit(machines).transform(machines)

# one-hot encode
ohEncoders = [OneHotEncoder(inputCol=x + '_indexed', outputCol=x + '_encoded')
              for x in catVarNames]

ohPipelineModel = Pipeline(stages=ohEncoders).fit(machines_cat)
machines_cat = ohPipelineModel.transform(machines_cat)

drop_list = [col_n for col_n in machines_cat.columns if 'indexed' in col_n]

machines_feat = machines_cat.select([column for column in machines_cat.columns if column not in drop_list])

print(machines_feat.count())
machines_feat.limit(10).toPandas().head(10)









    



1000






    Out[17]:







  
    
      
      machineID
      model
      age
      model_encoded
    
  
  
    
      0
      1
      model2
      18
      (0.0, 0.0, 1.0)
    
    
      1
      2
      model4
      7
      (0.0, 1.0, 0.0)
    
    
      2
      3
      model3
      8
      (1.0, 0.0, 0.0)
    
    
      3
      4
      model3
      7
      (1.0, 0.0, 0.0)
    
    
      4
      5
      model2
      2
      (0.0, 0.0, 1.0)
    
    
      5
      6
      model3
      7
      (1.0, 0.0, 0.0)
    
    
      6
      7
      model4
      20
      (0.0, 1.0, 0.0)
    
    
      7
      8
      model3
      16
      (1.0, 0.0, 0.0)
    
    
      8
      9
      model1
      7
      (0.0, 0.0, 0.0)
    
    
      9
      10
      model1
      10
      (0.0, 0.0, 0.0)

Merging feature data

Next, we merge the telemetry, maintenance, machine and error feature data sets into a large feature data set. Since most of the data has already been aligned on the 12 hour observation period, we can merge with a simple join strategy.



In [18]:

    
# join error features with component maintenance features
error_maint = (error_feat.join(maint_feat, 
                               ((error_feat['machineID'] == maint_feat['machineID']) 
                                & (error_feat['dt_truncated'] == maint_feat['dt_truncated'])), "left")
               .drop(maint_feat.machineID).drop(maint_feat.dt_truncated))

# now join that with machines features
error_maint_feat = (error_maint.join(machines_feat, 
                                     ((error_maint['machineID'] == machines_feat['machineID'])), "left")
                    .drop(machines_feat.machineID))

# Clean up some unecessary columns
error_maint_feat = error_maint_feat.select([c for c in error_maint_feat.columns if c not in 
                                            {'error1sum', 'error2sum', 'error3sum', 'error4sum', 'error5sum'}])

# join telemetry with error/maint/machine features to create final feature matrix
final_feat = (telemetry_feat.join(error_maint_feat, 
                                  ((telemetry_feat['machineID'] == error_maint_feat['machineID']) 
                                   & (telemetry_feat['dt_truncated'] == error_maint_feat['dt_truncated'])), "left")
              .drop(error_maint_feat.machineID).drop(error_maint_feat.dt_truncated))

print(final_feat.count())
final_feat.filter(final_feat.machineID == '625').orderBy(final_feat.dt_truncated).limit(10).toPandas().head(10)









    



731000






    Out[18]:







  
    
      
      machineID
      dt_truncated
      volt_rollingmean_12
      rotate_rollingmean_12
      pressure_rollingmean_12
      vibration_rollingmean_12
      volt_rollingmean_24
      rotate_rollingmean_24
      pressure_rollingmean_24
      vibration_rollingmean_24
      ...
      error3sum_rollingmean_24
      error4sum_rollingmean_24
      error5sum_rollingmean_24
      comp1sum
      comp2sum
      comp3sum
      comp4sum
      model
      age
      model_encoded
    
  
  
    
      0
      625
      2015-01-01 12:00:00
      169.065806
      453.899968
      97.857385
      44.903816
      169.065806
      453.899968
      97.857385
      44.903816
      ...
      0.0
      0.0
      0.0
      94.0
      19.0
      19.0
      139.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      1
      625
      2015-01-02 00:00:00
      166.187365
      458.219143
      95.377812
      42.361593
      166.267437
      459.462370
      96.064038
      42.685859
      ...
      0.0
      0.0
      0.0
      95.0
      20.0
      20.0
      140.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      2
      625
      2015-01-02 12:00:00
      169.363503
      455.143198
      97.519219
      41.000897
      167.775434
      456.681171
      96.448515
      41.681245
      ...
      0.0
      0.0
      0.0
      95.0
      20.0
      20.0
      140.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      3
      625
      2015-01-03 00:00:00
      172.504043
      461.494330
      101.483771
      40.299350
      170.933773
      458.318764
      99.501495
      40.650123
      ...
      0.0
      0.0
      0.0
      96.0
      21.0
      21.0
      141.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      4
      625
      2015-01-03 12:00:00
      174.102964
      442.074061
      99.900129
      39.624068
      173.303503
      451.784195
      100.691950
      39.961709
      ...
      0.0
      0.0
      0.0
      96.0
      21.0
      21.0
      141.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      5
      625
      2015-01-04 00:00:00
      172.353833
      468.246837
      102.862584
      39.630433
      173.228399
      455.160449
      101.381357
      39.627250
      ...
      0.0
      0.0
      0.0
      97.0
      22.0
      22.0
      142.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      6
      625
      2015-01-04 12:00:00
      168.077031
      450.096787
      99.694484
      40.363481
      170.215432
      459.171812
      101.278534
      39.996957
      ...
      0.0
      0.0
      0.0
      97.0
      22.0
      22.0
      142.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      7
      625
      2015-01-05 00:00:00
      170.473920
      424.639863
      99.638702
      37.534512
      169.275475
      437.368325
      99.666593
      38.948996
      ...
      0.0
      0.0
      0.0
      98.0
      23.0
      23.0
      143.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      8
      625
      2015-01-05 12:00:00
      171.452495
      440.234798
      97.971502
      38.377679
      170.963208
      432.437330
      98.805102
      37.956096
      ...
      0.0
      0.0
      0.0
      98.0
      23.0
      23.0
      143.0
      model3
      13
      (1.0, 0.0, 0.0)
    
    
      9
      625
      2015-01-06 00:00:00
      168.265893
      448.031539
      101.101864
      39.984267
      169.859194
      444.133169
      99.536683
      39.180973
      ...
      0.0
      0.0
      0.0
      99.0
      24.0
      24.0
      144.0
      model3
      13
      (1.0, 0.0, 0.0)
    
  

10 rows × 38 columns

Label construction

Predictive maintenance is supervised learning. To train a model to predict failures requires examples of failures, and the time series of observations leading up to those failures. Additionally, the model needs examples of periods of healthy operation in order to discern the difference between the two states. The classification between these states is typically a boolean label (healthy vs failed).

Once we have the healthy vs. failure states, the predictive maintenance approach is only useful if the method will give some advanced warning of an impending failure. To accomplish this prior warning criteria, we slightly modify the label definition from a failure event which occurs at a specific moment in time, to a longer window of failure event occurs within this window. The window length is defined by the business criteria. Is knowing a failure will occur within 12 hours, enough time to prevent the failure from happening? Is 24 hours, or 2 weeks? The ability of the model to accurately predict an impending failure is dependent sizing this window. If the failure signal is short, longer windows will not help, and can actually degrade, the potential performance.

To acheive the redefinition of failure to about to fail, we over label failure events, labeling all observations within the failure warning window as failed. The prediction problem then becomes estimating the probability of failure within this window.

For this example scenerio, we estimate the probability that a machine will fail in the near future due to a failure of a certain component. More specifically, the goal is to compute the probability that a machine will fail in the next 7 days due to a component failure (component 1, 2, 3, or 4).

Below, a categorical failure feature is created to serve as the label. All records within a 24 hour window before a failure of component 1 have failure="comp1", and so on for components 2, 3, and 4; all records not within 7 days of a component failure have failure="none".

The first step is to alighn the failure data to the feature observation time points (every 12 hours).



In [19]:

    
dt_truncated = ((round(unix_timestamp(col("datetime")) / time_val) * time_val).cast("timestamp"))

fail_diff = (failures.withColumn("dt_truncated", dt_truncated)
             .drop(failures.datetime))

print(fail_diff.count())
fail_diff.limit(10).toPandas().head(10)









    



6368






    Out[19]:







  
    
      
      machineID
      failure
      dt_truncated
    
  
  
    
      0
      7
      comp1
      2015-07-31 12:00:00
    
    
      1
      179
      comp4
      2015-02-17 12:00:00
    
    
      2
      191
      comp1
      2015-12-28 12:00:00
    
    
      3
      221
      comp2
      2015-06-13 12:00:00
    
    
      4
      262
      comp2
      2015-06-28 12:00:00
    
    
      5
      288
      comp2
      2015-10-19 12:00:00
    
    
      6
      322
      comp3
      2015-11-27 12:00:00
    
    
      7
      346
      comp4
      2015-02-17 12:00:00
    
    
      8
      352
      comp1
      2015-11-27 12:00:00
    
    
      9
      382
      comp3
      2015-11-22 12:00:00

Next, we convert the labels from text to numeric values. In the end, this will transform the problem from boolean of 'healthy'/'impending failure' to a multiclass 'healthy'/'component n impending failure'.



In [20]:

    
# map the failure data to final feature matrix
labeled_features = (final_feat.join(fail_diff, 
                                    ((final_feat['machineID'] == fail_diff['machineID']) 
                                     & (final_feat['dt_truncated'] == fail_diff['dt_truncated'])), "left")
                    .drop(fail_diff.machineID).drop(fail_diff.dt_truncated)
                    .withColumn('failure', F.when(col('failure') == "comp1", 1.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp2", 2.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp3", 3.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp4", 4.0).otherwise(col('failure'))))

labeled_features = (labeled_features.withColumn("failure", 
                                                labeled_features.failure.cast(DoubleType()))
                    .fillna(0))

print(labeled_features.count())
labeled_features.limit(10).toPandas().head(10)









    



731000






    Out[20]:







  
    
      
      machineID
      dt_truncated
      volt_rollingmean_12
      rotate_rollingmean_12
      pressure_rollingmean_12
      vibration_rollingmean_12
      volt_rollingmean_24
      rotate_rollingmean_24
      pressure_rollingmean_24
      vibration_rollingmean_24
      ...
      error4sum_rollingmean_24
      error5sum_rollingmean_24
      comp1sum
      comp2sum
      comp3sum
      comp4sum
      model
      age
      model_encoded
      failure
    
  
  
    
      0
      1
      2015-04-02 12:00:00
      172.270947
      453.797903
      100.483543
      39.492142
      171.338531
      447.862397
      100.398683
      39.738176
      ...
      0.0
      0.0
      200.0
      200.0
      140.0
      275.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      1
      1
      2015-05-12 00:00:00
      171.869404
      453.529162
      100.214013
      40.171795
      171.649903
      458.103907
      101.813178
      40.830119
      ...
      0.0
      0.0
      240.0
      240.0
      180.0
      315.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      2
      1
      2015-07-26 12:00:00
      167.325797
      450.548728
      96.380156
      41.093641
      168.417015
      446.610707
      97.754618
      39.830570
      ...
      0.0
      0.0
      315.0
      315.0
      255.0
      390.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      3
      1
      2015-08-04 00:00:00
      170.231819
      482.393571
      98.709572
      39.771546
      171.876957
      473.301415
      103.358692
      43.025796
      ...
      0.0
      0.0
      324.0
      324.0
      264.0
      399.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      4
      1
      2015-08-23 12:00:00
      170.203712
      425.094036
      101.529310
      39.173365
      167.913767
      447.531697
      100.026931
      38.672964
      ...
      0.0
      0.0
      343.0
      343.0
      283.0
      418.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      5
      1
      2015-12-17 00:00:00
      169.167041
      455.734514
      98.842483
      39.104152
      171.370731
      452.306194
      98.278780
      42.471398
      ...
      0.0
      0.0
      459.0
      459.0
      399.0
      534.0
      model2
      18
      (0.0, 0.0, 1.0)
      0.0
    
    
      6
      2
      2015-05-31 00:00:00
      169.996558
      382.645637
      98.742043
      41.084371
      171.878438
      394.960639
      100.302806
      45.069415
      ...
      0.0
      0.0
      334.0
      199.0
      229.0
      304.0
      model4
      7
      (0.0, 1.0, 0.0)
      0.0
    
    
      7
      2
      2015-08-23 00:00:00
      171.318109
      465.410280
      95.714662
      40.776288
      170.363218
      453.301829
      98.142287
      40.311255
      ...
      0.0
      0.0
      418.0
      283.0
      313.0
      388.0
      model4
      7
      (0.0, 1.0, 0.0)
      0.0
    
    
      8
      2
      2015-09-14 12:00:00
      172.131803
      446.534351
      101.164512
      40.007441
      170.763787
      445.208244
      99.215403
      39.185955
      ...
      0.0
      0.0
      440.0
      305.0
      335.0
      410.0
      model4
      7
      (0.0, 1.0, 0.0)
      0.0
    
    
      9
      3
      2015-02-17 12:00:00
      168.963246
      455.213179
      103.499874
      39.224858
      172.191489
      453.614534
      102.485170
      39.084057
      ...
      0.0
      0.0
      261.0
      261.0
      201.0
      156.0
      model3
      8
      (1.0, 0.0, 0.0)
      0.0
    
  

10 rows × 39 columns

To verify we have assigned the component failure records correctly, we count the failure classes within the feature data.



In [21]:

    
# To get the frequency of each component failure 
df = labeled_features.select(labeled_features.failure).toPandas()
df['failure'].value_counts()









    Out[21]:





0.0    724632
2.0      2467
1.0      1886
4.0      1103
3.0       912
Name: failure, dtype: int64

To now, we have labels as failure events. To convert to impending failure, we over label over the previous 7 days as failed.



In [22]:

    
# lag values to manually backfill label (bfill =7)
my_window = Window.partitionBy('machineID').orderBy(labeled_features.dt_truncated.desc())

# Create the previous 7 days 
labeled_features = (labeled_features.withColumn("prev_value1", 
                                                F.lag(labeled_features.failure).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value2", 
                                                F.lag(labeled_features.prev_value1).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value3", 
                                                F.lag(labeled_features.prev_value2).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value4", 
                                                F.lag(labeled_features.prev_value3).
                                                over(my_window)).fillna(0)) 
labeled_features = (labeled_features.withColumn("prev_value5", 
                                                F.lag(labeled_features.prev_value4).
                                                over(my_window)).fillna(0)) 
labeled_features = (labeled_features.withColumn("prev_value6", 
                                                F.lag(labeled_features.prev_value5).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value7", 
                                                F.lag(labeled_features.prev_value6).
                                                over(my_window)).fillna(0))

# Create a label features
labeled_features = (labeled_features.withColumn('label', labeled_features.failure + 
                                                labeled_features.prev_value1 +
                                                labeled_features.prev_value2 +
                                                labeled_features.prev_value3 +
                                                labeled_features.prev_value4 +
                                                labeled_features.prev_value5 + 
                                                labeled_features.prev_value6 + 
                                                labeled_features.prev_value7))

# Restrict the label to be on the range of 0:4, and remove extra columns
labeled_features = (labeled_features.withColumn('label_e', F.when(col('label') > 4, 4.0)
                                                .otherwise(col('label')))
                    .drop(labeled_features.prev_value1).drop(labeled_features.prev_value2)
                    .drop(labeled_features.prev_value3).drop(labeled_features.prev_value4)
                    .drop(labeled_features.prev_value5).drop(labeled_features.prev_value6)
                    .drop(labeled_features.prev_value7).drop(labeled_features.label))

print(labeled_features.count())
labeled_features.limit(10).toPandas().head(10)









    



731000






    Out[22]:







  
    
      
      machineID
      dt_truncated
      volt_rollingmean_12
      rotate_rollingmean_12
      pressure_rollingmean_12
      vibration_rollingmean_12
      volt_rollingmean_24
      rotate_rollingmean_24
      pressure_rollingmean_24
      vibration_rollingmean_24
      ...
      error5sum_rollingmean_24
      comp1sum
      comp2sum
      comp3sum
      comp4sum
      model
      age
      model_encoded
      failure
      label_e
    
  
  
    
      0
      26
      2016-01-01 12:00:00
      164.275746
      471.649368
      127.520119
      42.709311
      166.077766
      460.249765
      125.377577
      41.343086
      ...
      0.0
      519.0
      384.0
      429.0
      534.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      1
      26
      2016-01-01 00:00:00
      165.726644
      456.042954
      125.058244
      41.195437
      167.122468
      451.586357
      121.224495
      40.970899
      ...
      0.0
      519.0
      384.0
      429.0
      534.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      2
      26
      2015-12-31 12:00:00
      168.518292
      447.129760
      117.390746
      40.746361
      167.497024
      444.483544
      107.757137
      41.108126
      ...
      0.0
      518.0
      383.0
      428.0
      533.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      3
      26
      2015-12-31 00:00:00
      166.475756
      441.837328
      98.123529
      41.469891
      168.086449
      439.266125
      97.874147
      41.675786
      ...
      0.0
      518.0
      383.0
      428.0
      533.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      4
      26
      2015-12-30 12:00:00
      169.697142
      436.694922
      97.624766
      41.881681
      170.345996
      446.386919
      100.487811
      41.239698
      ...
      0.0
      517.0
      382.0
      427.0
      532.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      5
      26
      2015-12-30 00:00:00
      170.994850
      456.078917
      103.350856
      40.597715
      171.051625
      448.379665
      102.448812
      39.829895
      ...
      0.0
      517.0
      382.0
      427.0
      532.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      6
      26
      2015-12-29 12:00:00
      171.108399
      440.680414
      101.546768
      39.062075
      169.303921
      436.404334
      102.623321
      39.715629
      ...
      0.0
      516.0
      381.0
      426.0
      531.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      7
      26
      2015-12-29 00:00:00
      167.499443
      432.128254
      103.699874
      40.369184
      165.894769
      446.484917
      102.423560
      40.815227
      ...
      0.0
      516.0
      381.0
      426.0
      531.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      8
      26
      2015-12-28 12:00:00
      164.290096
      460.841581
      101.147246
      41.261270
      167.538298
      453.480096
      100.551309
      41.303765
      ...
      0.0
      515.0
      380.0
      425.0
      530.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
    
      9
      26
      2015-12-28 00:00:00
      170.786501
      446.118612
      99.955372
      41.346260
      169.011702
      451.989171
      100.539022
      41.103588
      ...
      0.0
      515.0
      380.0
      425.0
      530.0
      model3
      3
      (1.0, 0.0, 0.0)
      0.0
      0.0
    
  

10 rows × 40 columns

To verify the label construction, we plot a sample of four machines over the data set life time. We expect the labels to cluster for each component, since there are 7 day windows of "fail". We have omitted the healthy labels, as they are uninformative. Since the labels are actually classes, the plot as four distinct values on the y-axis.



In [23]:

    
plt_dta = (labeled_features.filter(labeled_features.label_e > 0)
           .where(col("machineID").isin({"65", "558", "222", "965"}))
           .select(labeled_features.machineID, labeled_features.dt_truncated, labeled_features.label_e)
           .toPandas())

# format datetime field which comes in as string
plt_dta['dt_truncated'] = pd.to_datetime(plt_dta['dt_truncated'], format="%Y-%m-%d %H:%M:%S")
plt_dta.label_e = plt_dta.label_e.astype(int)

ggplot(aes(x="dt_truncated", y="label_e", color="label_e"), plt_dta) +\
    geom_point()+\
    xlab("Date") + ylab("Component Number") +\
    scale_x_date(labels=date_format('%m-%d')) +\
    scale_color_brewer(type = 'seq', palette = 'BuGn') +\
    facet_grid('machineID')









    












    Out[23]:





<ggplot: (-9223363302087609144)>

Here we see that most of the days are marked as healthy (label = 0 are omitted for plot performance, though the dates are still accurate). Each of the four machines have multiple failures over the course of the dataset. Each labeled failure includes the date of failure and the previous seven days, all are marked with the number indicating the component that failed.

The goal of the model will be to predict when a failure will occur and which component will fail simultaneously. This will be a multiclass classification problem, though we could pivot the data to individually predict binary failure of a component instead of a machine.

Write the feature data to cloud storage

Write the final labeled feature data as parquet file an Azure blob storage container. For technical details, see: https://github.com/Azure/ViennaDocs/blob/master/Documentation/UsingBlobForStorage.md



In [24]:

    
# Create a new container if necessary, otherwise you can use an existing container.
# This command creates the container if it does not already exist. Else it does nothing.
az_blob_service.create_container(STORAGE_CONTAINER_NAME, 
                                 fail_on_exist=False, 
                                 public_access=PublicAccess.Container)

# Write labeled feature data to blob for use in the next notebook
labeled_features.write.mode('overwrite').parquet(FEATURES_LOCAL_DIRECT)

# Delete the old data.
for blob in az_blob_service.list_blobs(STORAGE_CONTAINER_NAME):
    if FEATURES_LOCAL_DIRECT in blob.name:
        az_blob_service.delete_blob(STORAGE_CONTAINER_NAME, blob.name)

# upload the entire folder into blob storage
for name in glob.iglob(FEATURES_LOCAL_DIRECT + '/*'):
    print(os.path.abspath(name))
    az_blob_service.create_blob_from_path(STORAGE_CONTAINER_NAME, name, name)

print("Feature engineering final dataset files saved!")

# Time the notebook execution. 
# This will only make sense if you "Run All" cells
toc = time.time()
print("Full run took %.2f minutes" % ((toc - tic)/60))

logger.log("Feature Engineering Run time", ((toc - tic)/60))









    



/azureml-run/featureengineering_files.parquet/part-00107-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00009-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00120-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00081-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00169-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00138-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00015-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00025-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00099-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00158-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00174-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00175-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00001-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00103-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00063-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00007-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00122-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00151-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00163-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00042-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00189-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00093-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00088-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00090-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00117-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00115-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00126-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00145-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00143-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00048-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00131-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00199-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00073-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00180-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00012-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00121-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00197-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00130-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00098-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00116-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00171-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00029-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00095-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00156-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00140-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00047-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00080-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00017-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00043-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00087-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00005-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00002-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00144-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00008-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00168-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00018-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00114-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00142-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00038-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00006-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00022-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00044-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00155-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00166-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00068-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00079-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00134-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00152-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00051-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00188-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00139-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00101-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00162-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00193-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00011-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00026-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00132-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00190-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00004-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00149-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00097-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00067-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00031-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00147-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00124-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00105-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00154-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00164-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00016-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00100-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00083-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00172-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00076-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00034-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00065-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00118-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00128-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00082-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00061-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00021-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00023-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00123-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00041-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00084-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00078-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00160-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00085-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00102-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00137-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00000-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00112-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00053-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00157-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00091-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00182-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/_SUCCESS
/azureml-run/featureengineering_files.parquet/part-00191-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00075-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00167-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00059-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00019-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00110-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00055-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00054-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00165-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00030-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00170-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00064-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00125-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00111-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00127-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00027-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00010-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00066-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00176-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00148-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00129-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00094-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00039-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00194-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00013-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00036-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00057-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00153-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00113-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00104-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00024-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00119-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00028-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00133-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00086-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00136-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00069-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00184-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00092-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00077-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00146-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00049-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00195-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00050-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00003-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00178-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00186-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00179-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00070-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00074-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00071-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00183-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00037-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00072-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00109-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00096-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00060-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00062-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00150-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00032-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00040-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00014-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00173-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00159-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00058-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00177-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00033-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00198-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00196-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00089-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00106-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00192-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00052-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00141-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00161-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00185-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00020-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00181-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00056-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00108-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00046-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00045-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00187-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00035-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00135-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
Feature engineering final dataset files saved!
Full run took 22.44 minutes






    Out[24]:





<azureml.logging.script_run_request.ScriptRunRequest at 0x7f1b731572b0>

Conclusion

The next step is to build and compare machine learning models using the feature data set we have just created. The Code\3_model_building.ipynb notebook works through building a Decision Tree Classifier and a Random Forest Classifier using this data set.

	datetime	machineID	errorID
0	2015-10-04 06:00:00	163	error2
1	2015-10-04 06:00:00	163	error3
2	2015-10-06 18:00:00	163	error5
3	2015-11-03 06:00:00	163	error2
4	2015-11-03 06:00:00	163	error3

	datetime	machineID	comp
0	2015-10-12 06:00:00	316	comp1
1	2015-10-27 06:00:00	316	comp2
2	2015-11-11 06:00:00	316	comp3
3	2015-11-26 06:00:00	316	comp2
4	2015-12-11 06:00:00	316	comp1

	datetime	machineID	volt	rotate	pressure	vibration
0	2015-05-07 17:00:00	334	179.166785	358.163062	98.472561	41.881622
1	2015-05-07 18:00:00	334	212.375587	352.844237	97.050602	38.222228
2	2015-05-07 19:00:00	334	153.863807	327.330970	89.030084	38.443395
3	2015-05-07 20:00:00	334	156.149709	382.273117	106.714794	42.841213
4	2015-05-07 21:00:00	334	179.393273	350.117291	106.001312	35.740815

	datetime	machineID	failure
0	2015-07-31 06:00:00	7	comp1
1	2015-02-17 06:00:00	179	comp4
2	2015-12-28 06:00:00	191	comp1
3	2015-06-13 06:00:00	221	comp2
4	2015-06-28 06:00:00	262	comp2

	machineID	dt_truncated	volt_rollingmean_12	rotate_rollingmean_12	pressure_rollingmean_12	vibration_rollingmean_12	volt_rollingmean_24	rotate_rollingmean_24	pressure_rollingmean_24	vibration_rollingmean_24	...	pressure_rollingstd_12	vibration_rollingstd_12	volt_rollingstd_24	rotate_rollingstd_24	pressure_rollingstd_24	vibration_rollingstd_24	volt_rollingstd_36	rotate_rollingstd_36	pressure_rollingstd_36	vibration_rollingstd_36
0	1	2015-04-02 12:00:00	172.270947	453.797903	100.483543	39.492142	171.338531	447.862397	100.398683	39.738176	...	0.553654	0.759006	1.007532	3.219369	0.223788	0.157623	0.629986	0.722444	0.481365	0.088545
1	1	2015-05-12 00:00:00	171.869404	453.529162	100.214013	40.171795	171.649903	458.103907	101.813178	40.830119	...	1.950677	0.954435	0.732750	2.007912	1.015522	0.472419	0.561696	0.988960	0.580570	0.306208
2	1	2015-07-26 12:00:00	167.325797	450.548728	96.380156	41.093641	168.417015	446.610707	97.754618	39.830570	...	0.948548	0.460620	0.834651	1.777865	0.496723	0.181640	0.458351	1.382999	0.176032	0.107555
3	1	2015-08-04 00:00:00	170.231819	482.393571	98.709572	39.771546	171.876957	473.301415	103.358692	43.025796	...	1.007066	0.634358	0.546013	1.609741	1.535208	0.333413	0.546630	1.585655	0.186535	0.133264
4	1	2015-08-23 12:00:00	170.203712	425.094036	101.529310	39.173365	167.913767	447.531697	100.026931	38.672964	...	1.640248	0.397124	0.951870	4.648390	1.638073	0.297089	0.823227	2.742029	0.467837	0.189906
5	1	2015-12-17 00:00:00	169.167041	455.734514	98.842483	39.104152	171.370731	452.306194	98.278780	42.471398	...	1.031176	0.801300	0.874356	2.523363	1.116471	0.800095	0.505470	2.272075	0.260917	0.302917
6	1	2015-02-16 12:00:00	167.881679	459.764609	95.996968	41.355194	173.957011	452.269318	99.137917	40.716012	...	1.038664	0.734094	1.594118	4.929011	0.532804	0.139801	1.130919	3.430454	0.376233	0.134925
7	1	2015-10-30 12:00:00	171.302151	458.251722	100.603054	47.070461	169.921885	457.776886	98.524571	44.258167	...	1.684044	1.263139	0.323506	3.108947	0.426392	0.751400	0.352569	1.753114	0.381185	0.762853
8	1	2015-10-31 12:00:00	172.983326	409.865155	96.193019	51.727068	170.658641	424.695800	96.454905	51.746030	...	1.236143	0.363418	0.370740	11.858567	0.182346	0.205475	0.300758	6.649440	0.380280	0.680251
9	1	2015-12-11 12:00:00	173.352547	448.775731	101.412437	39.027765	171.292002	444.774327	101.879997	39.588702	...	1.520909	0.524005	0.491613	2.037811	0.483161	0.292491	0.961163	2.095341	0.306276	0.286464

	machineID	dt_truncated
0	26	2015-04-27 12:00:00
1	26	2015-06-28 12:00:00
2	26	2015-08-11 00:00:00
3	29	2015-03-04 00:00:00
4	29	2015-04-06 00:00:00
5	29	2015-05-14 12:00:00
6	29	2015-06-20 12:00:00
7	29	2015-10-16 00:00:00
8	474	2015-05-29 12:00:00
9	474	2015-09-15 00:00:00

	machineID	datetime_maint	comp1sum	comp2sum	comp3sum	comp4sum
0	567	2015-09-17 06:00:00	0	0	0	1
1	191	2015-12-28 06:00:00	1	0	0	0
2	301	2015-04-04 06:00:00	1	1	0	0
3	852	2015-06-14 06:00:00	0	0	1	0
4	942	2015-09-23 06:00:00	0	0	1	1
5	66	2015-09-30 06:00:00	0	0	1	0
6	370	2015-07-22 06:00:00	0	1	0	1
7	512	2015-05-03 06:00:00	0	0	1	0
8	427	2015-06-15 06:00:00	0	0	1	1
9	490	2015-10-26 06:00:00	1	1	0	0

	machineID	datetime_tel	sincelastcomp1
0	625	2015-01-01 12:00:00	94
1	625	2015-01-02 00:00:00	95
2	625	2015-01-02 12:00:00	95
3	625	2015-01-03 00:00:00	96
4	625	2015-01-03 12:00:00	96
5	625	2015-01-04 00:00:00	97
6	625	2015-01-04 12:00:00	97
7	625	2015-01-05 00:00:00	98
8	625	2015-01-05 12:00:00	98
9	625	2015-01-06 00:00:00	99
10	625	2015-01-06 12:00:00	99
11	625	2015-01-07 00:00:00	100
12	625	2015-01-07 12:00:00	100
13	625	2015-01-08 00:00:00	101
14	625	2015-01-08 12:00:00	101
15	625	2015-01-09 00:00:00	102
16	625	2015-01-09 12:00:00	102
17	625	2015-01-10 00:00:00	103
18	625	2015-01-10 12:00:00	103
19	625	2015-01-11 00:00:00	104

	machineID	dt_truncated	comp1sum	comp2sum	comp3sum	comp4sum
0	1	2015-04-02 12:00:00	200.0	200.0	140.0	275.0
1	1	2015-05-12 00:00:00	240.0	240.0	180.0	315.0
2	1	2015-07-26 12:00:00	315.0	315.0	255.0	390.0
3	1	2015-08-04 00:00:00	324.0	324.0	264.0	399.0
4	1	2015-08-23 12:00:00	343.0	343.0	283.0	418.0
5	1	2015-12-17 00:00:00	459.0	459.0	399.0	534.0
6	2	2015-05-31 00:00:00	334.0	199.0	229.0	304.0
7	2	2015-08-23 00:00:00	418.0	283.0	313.0	388.0
8	2	2015-09-14 12:00:00	440.0	305.0	335.0	410.0
9	3	2015-02-17 12:00:00	261.0	261.0	201.0	156.0

	machineID	model	age	model_encoded
0	1	model2	18	(0.0, 0.0, 1.0)
1	2	model4	7	(0.0, 1.0, 0.0)
2	3	model3	8	(1.0, 0.0, 0.0)
3	4	model3	7	(1.0, 0.0, 0.0)
4	5	model2	2	(0.0, 0.0, 1.0)
5	6	model3	7	(1.0, 0.0, 0.0)
6	7	model4	20	(0.0, 1.0, 0.0)
7	8	model3	16	(1.0, 0.0, 0.0)
8	9	model1	7	(0.0, 0.0, 0.0)
9	10	model1	10	(0.0, 0.0, 0.0)

	machineID	dt_truncated	volt_rollingmean_12	rotate_rollingmean_12	pressure_rollingmean_12	vibration_rollingmean_12	volt_rollingmean_24	rotate_rollingmean_24	pressure_rollingmean_24	vibration_rollingmean_24	...	comp1sum	comp2sum	comp3sum	comp4sum	model	age	model_encoded
0	625	2015-01-01 12:00:00	169.065806	453.899968	97.857385	44.903816	169.065806	453.899968	97.857385	44.903816	...	94.0	19.0	19.0	139.0	model3	13	(1.0, 0.0, 0.0)
1	625	2015-01-02 00:00:00	166.187365	458.219143	95.377812	42.361593	166.267437	459.462370	96.064038	42.685859	...	95.0	20.0	20.0	140.0	model3	13	(1.0, 0.0, 0.0)
2	625	2015-01-02 12:00:00	169.363503	455.143198	97.519219	41.000897	167.775434	456.681171	96.448515	41.681245	...	95.0	20.0	20.0	140.0	model3	13	(1.0, 0.0, 0.0)
3	625	2015-01-03 00:00:00	172.504043	461.494330	101.483771	40.299350	170.933773	458.318764	99.501495	40.650123	...	96.0	21.0	21.0	141.0	model3	13	(1.0, 0.0, 0.0)
4	625	2015-01-03 12:00:00	174.102964	442.074061	99.900129	39.624068	173.303503	451.784195	100.691950	39.961709	...	96.0	21.0	21.0	141.0	model3	13	(1.0, 0.0, 0.0)
5	625	2015-01-04 00:00:00	172.353833	468.246837	102.862584	39.630433	173.228399	455.160449	101.381357	39.627250	...	97.0	22.0	22.0	142.0	model3	13	(1.0, 0.0, 0.0)
6	625	2015-01-04 12:00:00	168.077031	450.096787	99.694484	40.363481	170.215432	459.171812	101.278534	39.996957	...	97.0	22.0	22.0	142.0	model3	13	(1.0, 0.0, 0.0)
7	625	2015-01-05 00:00:00	170.473920	424.639863	99.638702	37.534512	169.275475	437.368325	99.666593	38.948996	...	98.0	23.0	23.0	143.0	model3	13	(1.0, 0.0, 0.0)
8	625	2015-01-05 12:00:00	171.452495	440.234798	97.971502	38.377679	170.963208	432.437330	98.805102	37.956096	...	98.0	23.0	23.0	143.0	model3	13	(1.0, 0.0, 0.0)
9	625	2015-01-06 00:00:00	168.265893	448.031539	101.101864	39.984267	169.859194	444.133169	99.536683	39.180973	...	99.0	24.0	24.0	144.0	model3	13	(1.0, 0.0, 0.0)

	machineID	failure	dt_truncated
0	7	comp1	2015-07-31 12:00:00
1	179	comp4	2015-02-17 12:00:00
2	191	comp1	2015-12-28 12:00:00
3	221	comp2	2015-06-13 12:00:00
4	262	comp2	2015-06-28 12:00:00
5	288	comp2	2015-10-19 12:00:00
6	322	comp3	2015-11-27 12:00:00
7	346	comp4	2015-02-17 12:00:00
8	352	comp1	2015-11-27 12:00:00
9	382	comp3	2015-11-22 12:00:00

	machineID	dt_truncated	volt_rollingmean_12	rotate_rollingmean_12	pressure_rollingmean_12	vibration_rollingmean_12	volt_rollingmean_24	rotate_rollingmean_24	pressure_rollingmean_24	vibration_rollingmean_24	...	comp1sum	comp2sum	comp3sum	comp4sum	model	age	model_encoded
0	26	2016-01-01 12:00:00	164.275746	471.649368	127.520119	42.709311	166.077766	460.249765	125.377577	41.343086	...	519.0	384.0	429.0	534.0	model3	3	(1.0, 0.0, 0.0)
1	26	2016-01-01 00:00:00	165.726644	456.042954	125.058244	41.195437	167.122468	451.586357	121.224495	40.970899	...	519.0	384.0	429.0	534.0	model3	3	(1.0, 0.0, 0.0)
2	26	2015-12-31 12:00:00	168.518292	447.129760	117.390746	40.746361	167.497024	444.483544	107.757137	41.108126	...	518.0	383.0	428.0	533.0	model3	3	(1.0, 0.0, 0.0)
3	26	2015-12-31 00:00:00	166.475756	441.837328	98.123529	41.469891	168.086449	439.266125	97.874147	41.675786	...	518.0	383.0	428.0	533.0	model3	3	(1.0, 0.0, 0.0)
4	26	2015-12-30 12:00:00	169.697142	436.694922	97.624766	41.881681	170.345996	446.386919	100.487811	41.239698	...	517.0	382.0	427.0	532.0	model3	3	(1.0, 0.0, 0.0)
5	26	2015-12-30 00:00:00	170.994850	456.078917	103.350856	40.597715	171.051625	448.379665	102.448812	39.829895	...	517.0	382.0	427.0	532.0	model3	3	(1.0, 0.0, 0.0)
6	26	2015-12-29 12:00:00	171.108399	440.680414	101.546768	39.062075	169.303921	436.404334	102.623321	39.715629	...	516.0	381.0	426.0	531.0	model3	3	(1.0, 0.0, 0.0)
7	26	2015-12-29 00:00:00	167.499443	432.128254	103.699874	40.369184	165.894769	446.484917	102.423560	40.815227	...	516.0	381.0	426.0	531.0	model3	3	(1.0, 0.0, 0.0)
8	26	2015-12-28 12:00:00	164.290096	460.841581	101.147246	41.261270	167.538298	453.480096	100.551309	41.303765	...	515.0	380.0	425.0	530.0	model3	3	(1.0, 0.0, 0.0)
9	26	2015-12-28 00:00:00	170.786501	446.118612	99.955372	41.346260	169.011702	451.989171	100.539022	41.103588	...	515.0	380.0	425.0	530.0	model3	3	(1.0, 0.0, 0.0)