Step 2: Feature Engineering

According to Wikipedia, Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive.

This Feature engineering notebook will load the data sets created in the Data Ingestion notebook (Code/1_data_ingestion.ipynb) from an Azure storage container and combine them to create a single data set of features (variables) that can be used to infer a machines's health condition over time. The notebook steps through several feature engineering and labeling methods to create this data set for use in our predictive maintenance machine learning solution.

Note: This notebook will take about 20-30 minutes to execute all cells, depending on the compute configuration you have setup.


In [1]:
## Setup our environment by importing required libraries
import time
import os
import glob

# Read csv file from URL directly
import pandas as pd

# For creating some preliminary EDA plots.
%matplotlib inline
import matplotlib.pyplot as plt
from ggplot import *

import datetime
from pyspark.sql.functions import to_date

import pyspark.sql.functions as F
from pyspark.sql.functions import col, unix_timestamp, round
from pyspark.sql.functions import datediff
from pyspark.sql.window import Window
from pyspark.sql.types import DoubleType

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer

from pyspark.sql import SparkSession

# For Azure blob storage access
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess

# For logging model evaluation parameters back into the
# AML Workbench run history plots.
import logging
from azureml.logging import get_azureml_logger

amllog = logging.getLogger("azureml")
amllog.level = logging.INFO

# Turn on cell level logging.
%azureml history on
%azureml history show

# Time the notebook execution. 
# This will only make sense if you "Run all cells"
tic = time.time()

logger = get_azureml_logger() # logger writes to AMLWorkbench runtime view

spark = SparkSession.builder.getOrCreate()

# Telemetry
logger.log('amlrealworld.predictivemaintenance.feature_engineering','true')


History logging enabled
History logging is enabled
Out[1]:
<azureml.logging.script_run_request.ScriptRunRequest at 0x7f1b8b9337b8>

Load raw data from Azure Blob storage container

In the Data Ingestion notebook (Code/1_data_ingestion.ipynb), we downloaded, converted and stored the following data sets into Azure blob storage:

  • Machines: Features differentiating each machine. For example age and model.
  • Error: The log of non-critical errors. These errors may still indicate an impending component failure.
  • Maint: Machine maintenance history detailing component replacement or regular maintenance activities withe the date of replacement.
  • Telemetry: The operating conditions of a machine e.g. data collected from sensors.
  • Failure: The failure history of a machine or component within the machine.

We first load these files. Since the Azure Blob storage account name and account key are not passed between notebooks, you'll need to provide those here again.


In [2]:
# Enter your Azure blob storage details here 
ACCOUNT_NAME = "<your blob storage account name>"

# You can find the account key under the _Access Keys_ link in the 
# [Azure Portal](portal.azure.com) page for your Azure storage container.
ACCOUNT_KEY = "<your blob storage account key>"

#-------------------------------------------------------------------------------------------
# The data from the Data Aquisition note book is stored in the dataingestion container.
CONTAINER_NAME = "dataingestion"

# The data constructed in this notebook will be stored in the featureengineering container
STORAGE_CONTAINER_NAME = "featureengineering"

# Connect to your blob service     
az_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# We will store each of these data sets in blob storage in an 
# Azure Storage Container on your Azure subscription.
# See https://github.com/Azure/ViennaDocs/blob/master/Documentation/UsingBlobForStorage.md
# for details.

# These file names detail which blob each file is stored under. 
MACH_DATA = 'machines_files.parquet'
MAINT_DATA = 'maint_files.parquet'
ERROR_DATA = 'errors_files.parquet'
TELEMETRY_DATA = 'telemetry_files.parquet'
FAILURE_DATA = 'failure_files.parquet'

# These file names detail the local paths where we store the data results.
MACH_LOCAL_DIRECT = 'dataingestion_mach_result.parquet'
ERROR_LOCAL_DIRECT = 'dataingestion_err_result.parquet'
MAINT_LOCAL_DIRECT = 'dataingestion_maint_result.parquet'
TELEMETRY_LOCAL_DIRECT = 'dataingestion_tel_result.parquet'
FAILURES_LOCAL_DIRECT = 'dataingestion_fail_result.parquet'

# This is the final data file.
FEATURES_LOCAL_DIRECT = 'featureengineering_files.parquet'

Machines data set

Now, we load the machines data set from your Azure blob.


In [3]:
# create a local path  to store the data.
if not os.path.exists(MACH_LOCAL_DIRECT):
    os.makedirs(MACH_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if MACH_DATA in blob.name:
        local_file = os.path.join(MACH_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
machines = spark.read.parquet(MACH_LOCAL_DIRECT)

print(machines.count())
machines.limit(5).toPandas().head()


DONE creating a local directory!
1000
Out[3]:
machineID model age
0 1 model2 18
1 2 model4 7
2 3 model3 8
3 4 model3 7
4 5 model2 2

Errors data set

Load the errors data set from your Azure blob.


In [4]:
if not os.path.exists(ERROR_LOCAL_DIRECT):
    os.makedirs(ERROR_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if ERROR_DATA in blob.name:
        local_file = os.path.join(ERROR_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
errors = spark.read.parquet(ERROR_LOCAL_DIRECT)

print(errors.count())
errors.printSchema()
errors.limit(5).toPandas().head()


DONE creating a local directory!
11967
root
 |-- datetime: string (nullable = true)
 |-- machineID: long (nullable = true)
 |-- errorID: string (nullable = true)

Out[4]:
datetime machineID errorID
0 2015-10-04 06:00:00 163 error2
1 2015-10-04 06:00:00 163 error3
2 2015-10-06 18:00:00 163 error5
3 2015-11-03 06:00:00 163 error2
4 2015-11-03 06:00:00 163 error3

Maintenance data set

Load the maintenance data set from your Azure blob.


In [5]:
# create a local path  to store the data.
if not os.path.exists(MAINT_LOCAL_DIRECT):
    os.makedirs(MAINT_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if MAINT_DATA in blob.name:
        local_file = os.path.join(MAINT_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
maint = spark.read.parquet(MAINT_LOCAL_DIRECT)

print(maint.count())
maint.limit(5).toPandas().head()


DONE creating a local directory!
32592
Out[5]:
datetime machineID comp
0 2015-10-12 06:00:00 316 comp1
1 2015-10-27 06:00:00 316 comp2
2 2015-11-11 06:00:00 316 comp3
3 2015-11-26 06:00:00 316 comp2
4 2015-12-11 06:00:00 316 comp1

Telemetry

Load the telemetry data set from your Azure blob.


In [6]:
# create a local path  to store the data.
if not os.path.exists(TELEMETRY_LOCAL_DIRECT):
    os.makedirs(TELEMETRY_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# Connect to blob storage container
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if TELEMETRY_DATA in blob.name:
        local_file = os.path.join(TELEMETRY_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

# Read in the data
telemetry = spark.read.parquet(TELEMETRY_LOCAL_DIRECT)

print(telemetry.count())
telemetry.limit(5).toPandas().head()


DONE creating a local directory!
8761000
Out[6]:
datetime machineID volt rotate pressure vibration
0 2015-05-07 17:00:00 334 179.166785 358.163062 98.472561 41.881622
1 2015-05-07 18:00:00 334 212.375587 352.844237 97.050602 38.222228
2 2015-05-07 19:00:00 334 153.863807 327.330970 89.030084 38.443395
3 2015-05-07 20:00:00 334 156.149709 382.273117 106.714794 42.841213
4 2015-05-07 21:00:00 334 179.393273 350.117291 106.001312 35.740815

Failures data set

Load the failures data set from your Azure blob.


In [7]:
# create a local path  to store the data.
if not os.path.exists(FAILURES_LOCAL_DIRECT):
    os.makedirs(FAILURES_LOCAL_DIRECT)
    print('DONE creating a local directory!')


# download the entire parquet result folder to local path for a new run 
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if FAILURE_DATA in blob.name:
        local_file = os.path.join(FAILURES_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

failures = spark.read.parquet(FAILURES_LOCAL_DIRECT).dropDuplicates(['machineID', 'datetime'])

print(failures.count())
failures.limit(5).toPandas().head()


DONE creating a local directory!
6368
Out[7]:
datetime machineID failure
0 2015-07-31 06:00:00 7 comp1
1 2015-02-17 06:00:00 179 comp4
2 2015-12-28 06:00:00 191 comp1
3 2015-06-13 06:00:00 221 comp2
4 2015-06-28 06:00:00 262 comp2

Feature engineering

Our feature engineering will combine the different data sources together to create a single data set of features (variables) that can be used to infer a machines's health condition over time. The ultimate goal is to generate a single record for each time unit within each asset. The record combines features and labels to be fed into the machine learning algorithm.

Predictive maintenance take historical data, marked with a timestamp, to predict current health of a component and the probability of failure within some future window of time. These problems can be characterised as a classification method involving time series data. Time series, since we want to use historical observations to predict what will happen in the future. Classification, because we classify the future as having a probability of failure.

Lag features

There are many ways of creating features from the time series data. We start by dividing the duration of data collection into time units where each record belongs to a single point in time for each asset. The measurement unit for is in fact arbitrary. Time can be in seconds, minutes, hours, days, or months, or it can be measured in cycles, miles or transactions. The measurement choice is typically specific to the use case domain.

Additionally, the time unit does not have to be the same as the frequency of data collection. For example, if temperature values were being collected every 10 seconds, picking a time unit of 10 seconds for analysis may inflate the number of examples without providing any additional information if the temperature changes slowly. A better strategy may be to average the temperature over a longer time horizon which might better capture variations that contribute to the target outcome.

Once we set the frequency of observations, we want to look for trends within measures, over time, in order to predict performance degradation, which we would like to connect to how likely a component will fail. We create features for these trends within each record using time lags over previous observations to check for these performance changes. The lag window size $W$ is a hyper parameter that we can optimize. The following figures indicate a rolling aggregate window strategy for averaging a measure $t_i$ over a window $W = 3$ previous observations.

We are note constrained to averages, we can roll aggregates over counts, average, the standard deviation, outliers based on standard deviations, CUSUM measures, minimum and maximum values for the window.

We could also use a tumbling window approach, if we were interested in a different time window measure than the frequncy of the observations. For example, we might have obersvations evert 6 or 12 hours, but want to create features aligned on a day or week basis.

In the following sections, we will build our features using only a rolling strategy to demonstrate the process. We align our data, and then build features along those normalized observations times. We start with the telemetry data.

Telemetry features

Because the telemetry data set is the largest time series data we have, we start feature engineering here. The telemetry data has 8761000 hourly observations for out 1000 machines. We can improve the model performance by aligning our data by aggregating average sensor measures on a tumbling 12 hour window. In this case we replace the raw data with the tumbling window data, reducing the sensor data to 731000 observations. This will directly reduce the computaton time required to do the feature engineering, labeling and modeling required for our solution.

Once we have the reduced data, we set up our lag features by compute rolling aggregate measures such as mean, standard deviation, minimum, maximum, etc. to represent the short term history of the telemetry over time.

The following code blocks alignes the data on 12 hour observations and calculates a rolling mean and standard deviation of the telemetry data over the last 12, 24 and 36 hour lags.


In [8]:
# rolling mean and standard deviation
# Temporary storage for rolling means
tel_mean = telemetry

# Which features are we interested in telemetry data set
rolling_features = ['volt','rotate', 'pressure', 'vibration']
      
# n hours = n * 3600 seconds  
time_val = 12 * 3600

# Choose the time_val hour timestamps to align the data
# dt_truncated looks at the column named "datetime" in the current data set.
# remember that Spark is lazy... this doesn't execute until it is in a withColumn statement.
dt_truncated = ((round(unix_timestamp(col("datetime")) / time_val) * time_val).cast("timestamp"))

In [9]:
# We choose windows for our rolling windows 12hrs, 24 hrs and 36 hrs
lags = [12, 24, 36]

# align the data
for lag_n in lags:
    wSpec = Window.partitionBy('machineID').orderBy('datetime').rowsBetween(1-lag_n, 0)
    for col_name in rolling_features:
        tel_mean = tel_mean.withColumn(col_name+'_rollingmean_'+str(lag_n), 
                                       F.avg(col(col_name)).over(wSpec))
        tel_mean = tel_mean.withColumn(col_name+'_rollingstd_'+str(lag_n), 
                                       F.stddev(col(col_name)).over(wSpec))

# Calculate lag values...
telemetry_feat = (tel_mean.withColumn("dt_truncated", dt_truncated)
                  .drop('volt', 'rotate', 'pressure', 'vibration')
                  .fillna(0)
                  .groupBy("machineID","dt_truncated")
                  .agg(F.mean('volt_rollingmean_12').alias('volt_rollingmean_12'),
                       F.mean('rotate_rollingmean_12').alias('rotate_rollingmean_12'), 
                       F.mean('pressure_rollingmean_12').alias('pressure_rollingmean_12'), 
                       F.mean('vibration_rollingmean_12').alias('vibration_rollingmean_12'), 
                       F.mean('volt_rollingmean_24').alias('volt_rollingmean_24'),
                       F.mean('rotate_rollingmean_24').alias('rotate_rollingmean_24'), 
                       F.mean('pressure_rollingmean_24').alias('pressure_rollingmean_24'), 
                       F.mean('vibration_rollingmean_24').alias('vibration_rollingmean_24'),
                       F.mean('volt_rollingmean_36').alias('volt_rollingmean_36'),
                       F.mean('vibration_rollingmean_36').alias('vibration_rollingmean_36'),
                       F.mean('rotate_rollingmean_36').alias('rotate_rollingmean_36'), 
                       F.mean('pressure_rollingmean_36').alias('pressure_rollingmean_36'), 
                       F.stddev('volt_rollingstd_12').alias('volt_rollingstd_12'),
                       F.stddev('rotate_rollingstd_12').alias('rotate_rollingstd_12'), 
                       F.stddev('pressure_rollingstd_12').alias('pressure_rollingstd_12'), 
                       F.stddev('vibration_rollingstd_12').alias('vibration_rollingstd_12'), 
                       F.stddev('volt_rollingstd_24').alias('volt_rollingstd_24'),
                       F.stddev('rotate_rollingstd_24').alias('rotate_rollingstd_24'), 
                       F.stddev('pressure_rollingstd_24').alias('pressure_rollingstd_24'), 
                       F.stddev('vibration_rollingstd_24').alias('vibration_rollingstd_24'),
                       F.stddev('volt_rollingstd_36').alias('volt_rollingstd_36'),
                       F.stddev('rotate_rollingstd_36').alias('rotate_rollingstd_36'), 
                       F.stddev('pressure_rollingstd_36').alias('pressure_rollingstd_36'), 
                       F.stddev('vibration_rollingstd_36').alias('vibration_rollingstd_36'), ))

print(telemetry_feat.count())
telemetry_feat.where((col("machineID") == 1)).limit(10).toPandas().head(10)


731000
Out[9]:
machineID dt_truncated volt_rollingmean_12 rotate_rollingmean_12 pressure_rollingmean_12 vibration_rollingmean_12 volt_rollingmean_24 rotate_rollingmean_24 pressure_rollingmean_24 vibration_rollingmean_24 ... pressure_rollingstd_12 vibration_rollingstd_12 volt_rollingstd_24 rotate_rollingstd_24 pressure_rollingstd_24 vibration_rollingstd_24 volt_rollingstd_36 rotate_rollingstd_36 pressure_rollingstd_36 vibration_rollingstd_36
0 1 2015-04-02 12:00:00 172.270947 453.797903 100.483543 39.492142 171.338531 447.862397 100.398683 39.738176 ... 0.553654 0.759006 1.007532 3.219369 0.223788 0.157623 0.629986 0.722444 0.481365 0.088545
1 1 2015-05-12 00:00:00 171.869404 453.529162 100.214013 40.171795 171.649903 458.103907 101.813178 40.830119 ... 1.950677 0.954435 0.732750 2.007912 1.015522 0.472419 0.561696 0.988960 0.580570 0.306208
2 1 2015-07-26 12:00:00 167.325797 450.548728 96.380156 41.093641 168.417015 446.610707 97.754618 39.830570 ... 0.948548 0.460620 0.834651 1.777865 0.496723 0.181640 0.458351 1.382999 0.176032 0.107555
3 1 2015-08-04 00:00:00 170.231819 482.393571 98.709572 39.771546 171.876957 473.301415 103.358692 43.025796 ... 1.007066 0.634358 0.546013 1.609741 1.535208 0.333413 0.546630 1.585655 0.186535 0.133264
4 1 2015-08-23 12:00:00 170.203712 425.094036 101.529310 39.173365 167.913767 447.531697 100.026931 38.672964 ... 1.640248 0.397124 0.951870 4.648390 1.638073 0.297089 0.823227 2.742029 0.467837 0.189906
5 1 2015-12-17 00:00:00 169.167041 455.734514 98.842483 39.104152 171.370731 452.306194 98.278780 42.471398 ... 1.031176 0.801300 0.874356 2.523363 1.116471 0.800095 0.505470 2.272075 0.260917 0.302917
6 1 2015-02-16 12:00:00 167.881679 459.764609 95.996968 41.355194 173.957011 452.269318 99.137917 40.716012 ... 1.038664 0.734094 1.594118 4.929011 0.532804 0.139801 1.130919 3.430454 0.376233 0.134925
7 1 2015-10-30 12:00:00 171.302151 458.251722 100.603054 47.070461 169.921885 457.776886 98.524571 44.258167 ... 1.684044 1.263139 0.323506 3.108947 0.426392 0.751400 0.352569 1.753114 0.381185 0.762853
8 1 2015-10-31 12:00:00 172.983326 409.865155 96.193019 51.727068 170.658641 424.695800 96.454905 51.746030 ... 1.236143 0.363418 0.370740 11.858567 0.182346 0.205475 0.300758 6.649440 0.380280 0.680251
9 1 2015-12-11 12:00:00 173.352547 448.775731 101.412437 39.027765 171.292002 444.774327 101.879997 39.588702 ... 1.520909 0.524005 0.491613 2.037811 0.483161 0.292491 0.961163 2.095341 0.306276 0.286464

10 rows × 26 columns

Errors features

Like telemetry data, errors come with timestamps. An important difference is that the error IDs are categorical values and should not be averaged over time intervals like the telemetry measurements. Instead, we count the number of errors of each type within a lag window.

Again, we align the error counts data by tumbling over the 12 hour window using a join with telemetry data.


In [10]:
# create a column for each errorID 
error_ind = (errors.groupBy("machineID","datetime","errorID").pivot('errorID')
             .agg(F.count('machineID').alias('dummy')).drop('errorID').fillna(0)
             .groupBy("machineID","datetime")
             .agg(F.sum('error1').alias('error1sum'), 
                  F.sum('error2').alias('error2sum'), 
                  F.sum('error3').alias('error3sum'), 
                  F.sum('error4').alias('error4sum'), 
                  F.sum('error5').alias('error5sum')))

# join the telemetry data with errors
error_count = (telemetry.join(error_ind, 
                              ((telemetry['machineID'] == error_ind['machineID']) 
                               & (telemetry['datetime'] == error_ind['datetime'])), "left")
               .drop('volt', 'rotate', 'pressure', 'vibration')
               .drop(error_ind.machineID).drop(error_ind.datetime)
               .fillna(0))

error_features = ['error1sum','error2sum', 'error3sum', 'error4sum', 'error5sum']

wSpec = Window.partitionBy('machineID').orderBy('datetime').rowsBetween(1-24, 0)
for col_name in error_features:
    # We're only interested in the erros in the previous 24 hours.
    error_count = error_count.withColumn(col_name+'_rollingmean_24', 
                                         F.avg(col(col_name)).over(wSpec))

error_feat = (error_count.withColumn("dt_truncated", dt_truncated)
              .drop('error1sum', 'error2sum', 'error3sum', 'error4sum', 'error5sum').fillna(0)
              .groupBy("machineID","dt_truncated")
              .agg(F.mean('error1sum_rollingmean_24').alias('error1sum_rollingmean_24'), 
                   F.mean('error2sum_rollingmean_24').alias('error2sum_rollingmean_24'), 
                   F.mean('error3sum_rollingmean_24').alias('error3sum_rollingmean_24'), 
                   F.mean('error4sum_rollingmean_24').alias('error4sum_rollingmean_24'), 
                   F.mean('error5sum_rollingmean_24').alias('error5sum_rollingmean_24')))

print(error_feat.count())
error_feat.limit(10).toPandas().head(10)


731000
Out[10]:
machineID dt_truncated error1sum_rollingmean_24 error2sum_rollingmean_24 error3sum_rollingmean_24 error4sum_rollingmean_24 error5sum_rollingmean_24
0 26 2015-04-27 12:00:00 0.0 0.0 0.0 0.0 0.0
1 26 2015-06-28 12:00:00 0.0 0.0 0.0 0.0 0.0
2 26 2015-08-11 00:00:00 0.0 0.0 0.0 0.0 0.0
3 29 2015-03-04 00:00:00 0.0 0.0 0.0 0.0 0.0
4 29 2015-04-06 00:00:00 0.0 0.0 0.0 0.0 0.0
5 29 2015-05-14 12:00:00 0.0 0.0 0.0 0.0 0.0
6 29 2015-06-20 12:00:00 0.0 0.0 0.0 0.0 0.0
7 29 2015-10-16 00:00:00 0.0 0.0 0.0 0.0 0.0
8 474 2015-05-29 12:00:00 0.0 0.0 0.0 0.0 0.0
9 474 2015-09-15 00:00:00 0.0 0.0 0.0 0.0 0.0

Days since last replacement from maintenance

A crucial data set in this example is the use of maintenance records, which contain the information regarding component replacement. Possible features from this data set can be the number of replacements of each component over time or to calculate how long it has been since a component has been replaced. Replacement time is expected to correlate better with component failures since the longer a component is used, the more degradation would be expected.

As a side note, creating lagging features from maintenance data is not straight forward. This type of ad-hoc feature engineering is very common in predictive maintenance as domain knowledge plays a crucial role in understanding the predictors of a failure problem. In the following code blocks, the days since last component replacement are calculated for each component from the maintenance data. We start by counting the component replacements for the set of machines.


In [11]:
# create a column for each component replacement
maint_replace = (maint.groupBy("machineID","datetime","comp").pivot('comp')
                 .agg(F.count('machineID').alias('dummy')).fillna(0)
                 .groupBy("machineID","datetime")
                 .agg(F.sum('comp1').alias('comp1sum'), 
                      F.sum('comp2').alias('comp2sum'), 
                      F.sum('comp3').alias('comp3sum'),
                      F.sum('comp4').alias('comp4sum')))

maint_replace = maint_replace.withColumnRenamed('datetime','datetime_maint')

print(maint_replace.count())
maint_replace.limit(10).toPandas().head(10)


25121
Out[11]:
machineID datetime_maint comp1sum comp2sum comp3sum comp4sum
0 567 2015-09-17 06:00:00 0 0 0 1
1 191 2015-12-28 06:00:00 1 0 0 0
2 301 2015-04-04 06:00:00 1 1 0 0
3 852 2015-06-14 06:00:00 0 0 1 0
4 942 2015-09-23 06:00:00 0 0 1 1
5 66 2015-09-30 06:00:00 0 0 1 0
6 370 2015-07-22 06:00:00 0 1 0 1
7 512 2015-05-03 06:00:00 0 0 1 0
8 427 2015-06-15 06:00:00 0 0 1 1
9 490 2015-10-26 06:00:00 1 1 0 0

Replacement features are then created by tracking the number of days between each component replacement. We'll repeat these calculations for each of the four components and join them together into a maintenance feature table.

First component number 1 (comp1):


In [12]:
# We want to align the component information on telemetry features timestamps.
telemetry_times = (telemetry_feat.select(telemetry_feat.machineID, telemetry_feat.dt_truncated)
                   .withColumnRenamed('dt_truncated','datetime_tel'))

# Grab component 1 records
maint_comp1 = (maint_replace.where(col("comp1sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp2sum', 'comp3sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp1 = (telemetry_times.join(maint_comp1, 
                                        ((telemetry_times ['machineID']== maint_comp1['machineID']) 
                                         & (telemetry_times ['datetime_tel'] > maint_comp1['datetime_maint']) 
                                         & ( maint_comp1['comp1sum'] == '1')))
                   .drop(maint_comp1.machineID))

# Calculate the number of days between replacements
comp1 = (maint_tel_comp1.withColumn("sincelastcomp1", 
                                    datediff(maint_tel_comp1.datetime_tel, maint_tel_comp1.datetime_maint))
         .drop(maint_tel_comp1.datetime_maint).drop(maint_tel_comp1.comp1sum))

print(comp1.count())
comp1.filter(comp1.machineID == '625').orderBy(comp1.datetime_tel).limit(20).toPandas().head(20)


3254437
Out[12]:
machineID datetime_tel sincelastcomp1
0 625 2015-01-01 12:00:00 94
1 625 2015-01-02 00:00:00 95
2 625 2015-01-02 12:00:00 95
3 625 2015-01-03 00:00:00 96
4 625 2015-01-03 12:00:00 96
5 625 2015-01-04 00:00:00 97
6 625 2015-01-04 12:00:00 97
7 625 2015-01-05 00:00:00 98
8 625 2015-01-05 12:00:00 98
9 625 2015-01-06 00:00:00 99
10 625 2015-01-06 12:00:00 99
11 625 2015-01-07 00:00:00 100
12 625 2015-01-07 12:00:00 100
13 625 2015-01-08 00:00:00 101
14 625 2015-01-08 12:00:00 101
15 625 2015-01-09 00:00:00 102
16 625 2015-01-09 12:00:00 102
17 625 2015-01-10 00:00:00 103
18 625 2015-01-10 12:00:00 103
19 625 2015-01-11 00:00:00 104

Then component 2 (comp2):


In [13]:
# Grab component 2 records
maint_comp2 = (maint_replace.where(col("comp2sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp3sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp2 = (telemetry_times.join(maint_comp2, 
                                        ((telemetry_times ['machineID']== maint_comp2['machineID']) 
                                         & (telemetry_times ['datetime_tel'] > maint_comp2['datetime_maint']) 
                                         & ( maint_comp2['comp2sum'] == '1')))
                   .drop(maint_comp2.machineID))

# Calculate the number of days between replacements
comp2 = (maint_tel_comp2.withColumn("sincelastcomp2", 
                                    datediff(maint_tel_comp2.datetime_tel, maint_tel_comp2.datetime_maint))
         .drop(maint_tel_comp2.datetime_maint).drop(maint_tel_comp2.comp2sum))

print(comp2.count())
comp2.filter(comp2.machineID == '625').orderBy(comp2.datetime_tel).limit(5).toPandas().head(5)


3278730
Out[13]:
machineID datetime_tel sincelastcomp2
0 625 2015-01-01 12:00:00 19
1 625 2015-01-02 00:00:00 20
2 625 2015-01-02 12:00:00 20
3 625 2015-01-03 00:00:00 21
4 625 2015-01-03 12:00:00 21

Then component 3 (comp3):


In [14]:
# Grab component 3 records
maint_comp3 = (maint_replace.where(col("comp3sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp2sum', 'comp4sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp3 = (telemetry_times.join(maint_comp3, ((telemetry_times ['machineID']==maint_comp3['machineID']) 
                                                      & (telemetry_times ['datetime_tel'] > maint_comp3['datetime_maint']) 
                                                      & ( maint_comp3['comp3sum'] == '1')))
                   .drop(maint_comp3.machineID))

# Calculate the number of days between replacements
comp3 = (maint_tel_comp3.withColumn("sincelastcomp3", 
                                    datediff(maint_tel_comp3.datetime_tel, maint_tel_comp3.datetime_maint))
         .drop(maint_tel_comp3.datetime_maint).drop(maint_tel_comp3.comp3sum))


print(comp3.count())
comp3.filter(comp3.machineID == '625').orderBy(comp3.datetime_tel).limit(5).toPandas().head(5)


3345413
Out[14]:
machineID datetime_tel sincelastcomp3
0 625 2015-01-01 12:00:00 19
1 625 2015-01-02 00:00:00 20
2 625 2015-01-02 12:00:00 20
3 625 2015-01-03 00:00:00 21
4 625 2015-01-03 12:00:00 21

and component 4 (comp4):


In [15]:
# Grab component 4 records
maint_comp4 = (maint_replace.where(col("comp4sum") == '1').withColumnRenamed('datetime','datetime_maint')
               .drop('comp1sum', 'comp2sum', 'comp3sum'))

# Within each machine, get the last replacement date for each timepoint
maint_tel_comp4 = telemetry_times.join(maint_comp4, ((telemetry_times['machineID']==maint_comp4['machineID']) 
                                                     & (telemetry_times['datetime_tel'] > maint_comp4['datetime_maint']) 
                                                     & (maint_comp4['comp4sum'] == '1'))).drop(maint_comp4.machineID)

# Calculate the number of days between replacements
comp4 = (maint_tel_comp4.withColumn("sincelastcomp4", 
                                    datediff(maint_tel_comp4.datetime_tel, maint_tel_comp4.datetime_maint))
         .drop(maint_tel_comp4.datetime_maint).drop(maint_tel_comp4.comp4sum))

print(comp4.count())
comp4.filter(comp4.machineID == '625').orderBy(comp4.datetime_tel).limit(5).toPandas().head(5)


3273666
Out[15]:
machineID datetime_tel sincelastcomp4
0 625 2015-01-01 12:00:00 139
1 625 2015-01-02 00:00:00 140
2 625 2015-01-02 12:00:00 140
3 625 2015-01-03 00:00:00 141
4 625 2015-01-03 12:00:00 141

Now, we join the four component replacement tables together. Once joined, align the data by tumbling the average across 12 hour observation windows.


In [16]:
# Join component 3 and 4
comp3_4 = (comp3.join(comp4, ((comp3['machineID'] == comp4['machineID']) 
                              & (comp3['datetime_tel'] == comp4['datetime_tel'])), "left")
           .drop(comp4.machineID).drop(comp4.datetime_tel))

# Join component 2 to 3 and 4
comp2_3_4 = (comp2.join(comp3_4, ((comp2['machineID'] == comp3_4['machineID']) 
                                  & (comp2['datetime_tel'] == comp3_4['datetime_tel'])), "left")
             .drop(comp3_4.machineID).drop(comp3_4.datetime_tel))

# Join component 1 to 2, 3 and 4
comps_feat = (comp1.join(comp2_3_4, ((comp1['machineID'] == comp2_3_4['machineID']) 
                                      & (comp1['datetime_tel'] == comp2_3_4['datetime_tel'])), "left")
               .drop(comp2_3_4.machineID).drop(comp2_3_4.datetime_tel)
               .groupBy("machineID", "datetime_tel")
               .agg(F.max('sincelastcomp1').alias('sincelastcomp1'), 
                    F.max('sincelastcomp2').alias('sincelastcomp2'), 
                    F.max('sincelastcomp3').alias('sincelastcomp3'), 
                    F.max('sincelastcomp4').alias('sincelastcomp4'))
               .fillna(0))

# Choose the time_val hour timestamps to align the data
dt_truncated = ((round(unix_timestamp(col("datetime_tel")) / time_val) * time_val).cast("timestamp"))

# Collect data
maint_feat = (comps_feat.withColumn("dt_truncated", dt_truncated)
              .groupBy("machineID","dt_truncated")
              .agg(F.mean('sincelastcomp1').alias('comp1sum'), 
                   F.mean('sincelastcomp2').alias('comp2sum'), 
                   F.mean('sincelastcomp3').alias('comp3sum'), 
                   F.mean('sincelastcomp4').alias('comp4sum')))

print(maint_feat.count())
maint_feat.limit(10).toPandas().head(10)


731000
Out[16]:
machineID dt_truncated comp1sum comp2sum comp3sum comp4sum
0 1 2015-04-02 12:00:00 200.0 200.0 140.0 275.0
1 1 2015-05-12 00:00:00 240.0 240.0 180.0 315.0
2 1 2015-07-26 12:00:00 315.0 315.0 255.0 390.0
3 1 2015-08-04 00:00:00 324.0 324.0 264.0 399.0
4 1 2015-08-23 12:00:00 343.0 343.0 283.0 418.0
5 1 2015-12-17 00:00:00 459.0 459.0 399.0 534.0
6 2 2015-05-31 00:00:00 334.0 199.0 229.0 304.0
7 2 2015-08-23 00:00:00 418.0 283.0 313.0 388.0
8 2 2015-09-14 12:00:00 440.0 305.0 335.0 410.0
9 3 2015-02-17 12:00:00 261.0 261.0 201.0 156.0

Machine features

The machine features capture specifics of the individuals. These can be used without further modification since it include descriptive information about the type of each machine and its age (number of years in service). If the age information had been recorded as a "first use date" for each machine, a transformation would have been necessary to turn those into a numeric values indicating the years in service.

We do need to create a set of dummy features, a set of boolean variables, to indicate the model of the machine. This can either be done manually, or using a one-hot encoding step. We use the one-hot encoding for demonstration purposes.


In [17]:
# one hot encoding of the variable model, basically creates a set of dummy boolean variables
catVarNames = ['model']  
sIndexers = [StringIndexer(inputCol=x, outputCol=x + '_indexed') for x in catVarNames]
machines_cat = Pipeline(stages=sIndexers).fit(machines).transform(machines)

# one-hot encode
ohEncoders = [OneHotEncoder(inputCol=x + '_indexed', outputCol=x + '_encoded')
              for x in catVarNames]

ohPipelineModel = Pipeline(stages=ohEncoders).fit(machines_cat)
machines_cat = ohPipelineModel.transform(machines_cat)

drop_list = [col_n for col_n in machines_cat.columns if 'indexed' in col_n]

machines_feat = machines_cat.select([column for column in machines_cat.columns if column not in drop_list])

print(machines_feat.count())
machines_feat.limit(10).toPandas().head(10)


1000
Out[17]:
machineID model age model_encoded
0 1 model2 18 (0.0, 0.0, 1.0)
1 2 model4 7 (0.0, 1.0, 0.0)
2 3 model3 8 (1.0, 0.0, 0.0)
3 4 model3 7 (1.0, 0.0, 0.0)
4 5 model2 2 (0.0, 0.0, 1.0)
5 6 model3 7 (1.0, 0.0, 0.0)
6 7 model4 20 (0.0, 1.0, 0.0)
7 8 model3 16 (1.0, 0.0, 0.0)
8 9 model1 7 (0.0, 0.0, 0.0)
9 10 model1 10 (0.0, 0.0, 0.0)

Merging feature data

Next, we merge the telemetry, maintenance, machine and error feature data sets into a large feature data set. Since most of the data has already been aligned on the 12 hour observation period, we can merge with a simple join strategy.


In [18]:
# join error features with component maintenance features
error_maint = (error_feat.join(maint_feat, 
                               ((error_feat['machineID'] == maint_feat['machineID']) 
                                & (error_feat['dt_truncated'] == maint_feat['dt_truncated'])), "left")
               .drop(maint_feat.machineID).drop(maint_feat.dt_truncated))

# now join that with machines features
error_maint_feat = (error_maint.join(machines_feat, 
                                     ((error_maint['machineID'] == machines_feat['machineID'])), "left")
                    .drop(machines_feat.machineID))

# Clean up some unecessary columns
error_maint_feat = error_maint_feat.select([c for c in error_maint_feat.columns if c not in 
                                            {'error1sum', 'error2sum', 'error3sum', 'error4sum', 'error5sum'}])

# join telemetry with error/maint/machine features to create final feature matrix
final_feat = (telemetry_feat.join(error_maint_feat, 
                                  ((telemetry_feat['machineID'] == error_maint_feat['machineID']) 
                                   & (telemetry_feat['dt_truncated'] == error_maint_feat['dt_truncated'])), "left")
              .drop(error_maint_feat.machineID).drop(error_maint_feat.dt_truncated))

print(final_feat.count())
final_feat.filter(final_feat.machineID == '625').orderBy(final_feat.dt_truncated).limit(10).toPandas().head(10)


731000
Out[18]:
machineID dt_truncated volt_rollingmean_12 rotate_rollingmean_12 pressure_rollingmean_12 vibration_rollingmean_12 volt_rollingmean_24 rotate_rollingmean_24 pressure_rollingmean_24 vibration_rollingmean_24 ... error3sum_rollingmean_24 error4sum_rollingmean_24 error5sum_rollingmean_24 comp1sum comp2sum comp3sum comp4sum model age model_encoded
0 625 2015-01-01 12:00:00 169.065806 453.899968 97.857385 44.903816 169.065806 453.899968 97.857385 44.903816 ... 0.0 0.0 0.0 94.0 19.0 19.0 139.0 model3 13 (1.0, 0.0, 0.0)
1 625 2015-01-02 00:00:00 166.187365 458.219143 95.377812 42.361593 166.267437 459.462370 96.064038 42.685859 ... 0.0 0.0 0.0 95.0 20.0 20.0 140.0 model3 13 (1.0, 0.0, 0.0)
2 625 2015-01-02 12:00:00 169.363503 455.143198 97.519219 41.000897 167.775434 456.681171 96.448515 41.681245 ... 0.0 0.0 0.0 95.0 20.0 20.0 140.0 model3 13 (1.0, 0.0, 0.0)
3 625 2015-01-03 00:00:00 172.504043 461.494330 101.483771 40.299350 170.933773 458.318764 99.501495 40.650123 ... 0.0 0.0 0.0 96.0 21.0 21.0 141.0 model3 13 (1.0, 0.0, 0.0)
4 625 2015-01-03 12:00:00 174.102964 442.074061 99.900129 39.624068 173.303503 451.784195 100.691950 39.961709 ... 0.0 0.0 0.0 96.0 21.0 21.0 141.0 model3 13 (1.0, 0.0, 0.0)
5 625 2015-01-04 00:00:00 172.353833 468.246837 102.862584 39.630433 173.228399 455.160449 101.381357 39.627250 ... 0.0 0.0 0.0 97.0 22.0 22.0 142.0 model3 13 (1.0, 0.0, 0.0)
6 625 2015-01-04 12:00:00 168.077031 450.096787 99.694484 40.363481 170.215432 459.171812 101.278534 39.996957 ... 0.0 0.0 0.0 97.0 22.0 22.0 142.0 model3 13 (1.0, 0.0, 0.0)
7 625 2015-01-05 00:00:00 170.473920 424.639863 99.638702 37.534512 169.275475 437.368325 99.666593 38.948996 ... 0.0 0.0 0.0 98.0 23.0 23.0 143.0 model3 13 (1.0, 0.0, 0.0)
8 625 2015-01-05 12:00:00 171.452495 440.234798 97.971502 38.377679 170.963208 432.437330 98.805102 37.956096 ... 0.0 0.0 0.0 98.0 23.0 23.0 143.0 model3 13 (1.0, 0.0, 0.0)
9 625 2015-01-06 00:00:00 168.265893 448.031539 101.101864 39.984267 169.859194 444.133169 99.536683 39.180973 ... 0.0 0.0 0.0 99.0 24.0 24.0 144.0 model3 13 (1.0, 0.0, 0.0)

10 rows × 38 columns

Label construction

Predictive maintenance is supervised learning. To train a model to predict failures requires examples of failures, and the time series of observations leading up to those failures. Additionally, the model needs examples of periods of healthy operation in order to discern the difference between the two states. The classification between these states is typically a boolean label (healthy vs failed).

Once we have the healthy vs. failure states, the predictive maintenance approach is only useful if the method will give some advanced warning of an impending failure. To accomplish this prior warning criteria, we slightly modify the label definition from a failure event which occurs at a specific moment in time, to a longer window of failure event occurs within this window. The window length is defined by the business criteria. Is knowing a failure will occur within 12 hours, enough time to prevent the failure from happening? Is 24 hours, or 2 weeks? The ability of the model to accurately predict an impending failure is dependent sizing this window. If the failure signal is short, longer windows will not help, and can actually degrade, the potential performance.

To acheive the redefinition of failure to about to fail, we over label failure events, labeling all observations within the failure warning window as failed. The prediction problem then becomes estimating the probability of failure within this window.

For this example scenerio, we estimate the probability that a machine will fail in the near future due to a failure of a certain component. More specifically, the goal is to compute the probability that a machine will fail in the next 7 days due to a component failure (component 1, 2, 3, or 4).

Below, a categorical failure feature is created to serve as the label. All records within a 24 hour window before a failure of component 1 have failure="comp1", and so on for components 2, 3, and 4; all records not within 7 days of a component failure have failure="none".

The first step is to alighn the failure data to the feature observation time points (every 12 hours).


In [19]:
dt_truncated = ((round(unix_timestamp(col("datetime")) / time_val) * time_val).cast("timestamp"))

fail_diff = (failures.withColumn("dt_truncated", dt_truncated)
             .drop(failures.datetime))

print(fail_diff.count())
fail_diff.limit(10).toPandas().head(10)


6368
Out[19]:
machineID failure dt_truncated
0 7 comp1 2015-07-31 12:00:00
1 179 comp4 2015-02-17 12:00:00
2 191 comp1 2015-12-28 12:00:00
3 221 comp2 2015-06-13 12:00:00
4 262 comp2 2015-06-28 12:00:00
5 288 comp2 2015-10-19 12:00:00
6 322 comp3 2015-11-27 12:00:00
7 346 comp4 2015-02-17 12:00:00
8 352 comp1 2015-11-27 12:00:00
9 382 comp3 2015-11-22 12:00:00

Next, we convert the labels from text to numeric values. In the end, this will transform the problem from boolean of 'healthy'/'impending failure' to a multiclass 'healthy'/'component n impending failure'.


In [20]:
# map the failure data to final feature matrix
labeled_features = (final_feat.join(fail_diff, 
                                    ((final_feat['machineID'] == fail_diff['machineID']) 
                                     & (final_feat['dt_truncated'] == fail_diff['dt_truncated'])), "left")
                    .drop(fail_diff.machineID).drop(fail_diff.dt_truncated)
                    .withColumn('failure', F.when(col('failure') == "comp1", 1.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp2", 2.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp3", 3.0).otherwise(col('failure')))
                    .withColumn('failure', F.when(col('failure') == "comp4", 4.0).otherwise(col('failure'))))

labeled_features = (labeled_features.withColumn("failure", 
                                                labeled_features.failure.cast(DoubleType()))
                    .fillna(0))

print(labeled_features.count())
labeled_features.limit(10).toPandas().head(10)


731000
Out[20]:
machineID dt_truncated volt_rollingmean_12 rotate_rollingmean_12 pressure_rollingmean_12 vibration_rollingmean_12 volt_rollingmean_24 rotate_rollingmean_24 pressure_rollingmean_24 vibration_rollingmean_24 ... error4sum_rollingmean_24 error5sum_rollingmean_24 comp1sum comp2sum comp3sum comp4sum model age model_encoded failure
0 1 2015-04-02 12:00:00 172.270947 453.797903 100.483543 39.492142 171.338531 447.862397 100.398683 39.738176 ... 0.0 0.0 200.0 200.0 140.0 275.0 model2 18 (0.0, 0.0, 1.0) 0.0
1 1 2015-05-12 00:00:00 171.869404 453.529162 100.214013 40.171795 171.649903 458.103907 101.813178 40.830119 ... 0.0 0.0 240.0 240.0 180.0 315.0 model2 18 (0.0, 0.0, 1.0) 0.0
2 1 2015-07-26 12:00:00 167.325797 450.548728 96.380156 41.093641 168.417015 446.610707 97.754618 39.830570 ... 0.0 0.0 315.0 315.0 255.0 390.0 model2 18 (0.0, 0.0, 1.0) 0.0
3 1 2015-08-04 00:00:00 170.231819 482.393571 98.709572 39.771546 171.876957 473.301415 103.358692 43.025796 ... 0.0 0.0 324.0 324.0 264.0 399.0 model2 18 (0.0, 0.0, 1.0) 0.0
4 1 2015-08-23 12:00:00 170.203712 425.094036 101.529310 39.173365 167.913767 447.531697 100.026931 38.672964 ... 0.0 0.0 343.0 343.0 283.0 418.0 model2 18 (0.0, 0.0, 1.0) 0.0
5 1 2015-12-17 00:00:00 169.167041 455.734514 98.842483 39.104152 171.370731 452.306194 98.278780 42.471398 ... 0.0 0.0 459.0 459.0 399.0 534.0 model2 18 (0.0, 0.0, 1.0) 0.0
6 2 2015-05-31 00:00:00 169.996558 382.645637 98.742043 41.084371 171.878438 394.960639 100.302806 45.069415 ... 0.0 0.0 334.0 199.0 229.0 304.0 model4 7 (0.0, 1.0, 0.0) 0.0
7 2 2015-08-23 00:00:00 171.318109 465.410280 95.714662 40.776288 170.363218 453.301829 98.142287 40.311255 ... 0.0 0.0 418.0 283.0 313.0 388.0 model4 7 (0.0, 1.0, 0.0) 0.0
8 2 2015-09-14 12:00:00 172.131803 446.534351 101.164512 40.007441 170.763787 445.208244 99.215403 39.185955 ... 0.0 0.0 440.0 305.0 335.0 410.0 model4 7 (0.0, 1.0, 0.0) 0.0
9 3 2015-02-17 12:00:00 168.963246 455.213179 103.499874 39.224858 172.191489 453.614534 102.485170 39.084057 ... 0.0 0.0 261.0 261.0 201.0 156.0 model3 8 (1.0, 0.0, 0.0) 0.0

10 rows × 39 columns

To verify we have assigned the component failure records correctly, we count the failure classes within the feature data.


In [21]:
# To get the frequency of each component failure 
df = labeled_features.select(labeled_features.failure).toPandas()
df['failure'].value_counts()


Out[21]:
0.0    724632
2.0      2467
1.0      1886
4.0      1103
3.0       912
Name: failure, dtype: int64

To now, we have labels as failure events. To convert to impending failure, we over label over the previous 7 days as failed.


In [22]:
# lag values to manually backfill label (bfill =7)
my_window = Window.partitionBy('machineID').orderBy(labeled_features.dt_truncated.desc())

# Create the previous 7 days 
labeled_features = (labeled_features.withColumn("prev_value1", 
                                                F.lag(labeled_features.failure).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value2", 
                                                F.lag(labeled_features.prev_value1).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value3", 
                                                F.lag(labeled_features.prev_value2).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value4", 
                                                F.lag(labeled_features.prev_value3).
                                                over(my_window)).fillna(0)) 
labeled_features = (labeled_features.withColumn("prev_value5", 
                                                F.lag(labeled_features.prev_value4).
                                                over(my_window)).fillna(0)) 
labeled_features = (labeled_features.withColumn("prev_value6", 
                                                F.lag(labeled_features.prev_value5).
                                                over(my_window)).fillna(0))
labeled_features = (labeled_features.withColumn("prev_value7", 
                                                F.lag(labeled_features.prev_value6).
                                                over(my_window)).fillna(0))

# Create a label features
labeled_features = (labeled_features.withColumn('label', labeled_features.failure + 
                                                labeled_features.prev_value1 +
                                                labeled_features.prev_value2 +
                                                labeled_features.prev_value3 +
                                                labeled_features.prev_value4 +
                                                labeled_features.prev_value5 + 
                                                labeled_features.prev_value6 + 
                                                labeled_features.prev_value7))

# Restrict the label to be on the range of 0:4, and remove extra columns
labeled_features = (labeled_features.withColumn('label_e', F.when(col('label') > 4, 4.0)
                                                .otherwise(col('label')))
                    .drop(labeled_features.prev_value1).drop(labeled_features.prev_value2)
                    .drop(labeled_features.prev_value3).drop(labeled_features.prev_value4)
                    .drop(labeled_features.prev_value5).drop(labeled_features.prev_value6)
                    .drop(labeled_features.prev_value7).drop(labeled_features.label))

print(labeled_features.count())
labeled_features.limit(10).toPandas().head(10)


731000
Out[22]:
machineID dt_truncated volt_rollingmean_12 rotate_rollingmean_12 pressure_rollingmean_12 vibration_rollingmean_12 volt_rollingmean_24 rotate_rollingmean_24 pressure_rollingmean_24 vibration_rollingmean_24 ... error5sum_rollingmean_24 comp1sum comp2sum comp3sum comp4sum model age model_encoded failure label_e
0 26 2016-01-01 12:00:00 164.275746 471.649368 127.520119 42.709311 166.077766 460.249765 125.377577 41.343086 ... 0.0 519.0 384.0 429.0 534.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
1 26 2016-01-01 00:00:00 165.726644 456.042954 125.058244 41.195437 167.122468 451.586357 121.224495 40.970899 ... 0.0 519.0 384.0 429.0 534.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
2 26 2015-12-31 12:00:00 168.518292 447.129760 117.390746 40.746361 167.497024 444.483544 107.757137 41.108126 ... 0.0 518.0 383.0 428.0 533.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
3 26 2015-12-31 00:00:00 166.475756 441.837328 98.123529 41.469891 168.086449 439.266125 97.874147 41.675786 ... 0.0 518.0 383.0 428.0 533.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
4 26 2015-12-30 12:00:00 169.697142 436.694922 97.624766 41.881681 170.345996 446.386919 100.487811 41.239698 ... 0.0 517.0 382.0 427.0 532.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
5 26 2015-12-30 00:00:00 170.994850 456.078917 103.350856 40.597715 171.051625 448.379665 102.448812 39.829895 ... 0.0 517.0 382.0 427.0 532.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
6 26 2015-12-29 12:00:00 171.108399 440.680414 101.546768 39.062075 169.303921 436.404334 102.623321 39.715629 ... 0.0 516.0 381.0 426.0 531.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
7 26 2015-12-29 00:00:00 167.499443 432.128254 103.699874 40.369184 165.894769 446.484917 102.423560 40.815227 ... 0.0 516.0 381.0 426.0 531.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
8 26 2015-12-28 12:00:00 164.290096 460.841581 101.147246 41.261270 167.538298 453.480096 100.551309 41.303765 ... 0.0 515.0 380.0 425.0 530.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0
9 26 2015-12-28 00:00:00 170.786501 446.118612 99.955372 41.346260 169.011702 451.989171 100.539022 41.103588 ... 0.0 515.0 380.0 425.0 530.0 model3 3 (1.0, 0.0, 0.0) 0.0 0.0

10 rows × 40 columns

To verify the label construction, we plot a sample of four machines over the data set life time. We expect the labels to cluster for each component, since there are 7 day windows of "fail". We have omitted the healthy labels, as they are uninformative. Since the labels are actually classes, the plot as four distinct values on the y-axis.


In [23]:
plt_dta = (labeled_features.filter(labeled_features.label_e > 0)
           .where(col("machineID").isin({"65", "558", "222", "965"}))
           .select(labeled_features.machineID, labeled_features.dt_truncated, labeled_features.label_e)
           .toPandas())

# format datetime field which comes in as string
plt_dta['dt_truncated'] = pd.to_datetime(plt_dta['dt_truncated'], format="%Y-%m-%d %H:%M:%S")
plt_dta.label_e = plt_dta.label_e.astype(int)

ggplot(aes(x="dt_truncated", y="label_e", color="label_e"), plt_dta) +\
    geom_point()+\
    xlab("Date") + ylab("Component Number") +\
    scale_x_date(labels=date_format('%m-%d')) +\
    scale_color_brewer(type = 'seq', palette = 'BuGn') +\
    facet_grid('machineID')


Out[23]:
<ggplot: (-9223363302087609144)>

Here we see that most of the days are marked as healthy (label = 0 are omitted for plot performance, though the dates are still accurate). Each of the four machines have multiple failures over the course of the dataset. Each labeled failure includes the date of failure and the previous seven days, all are marked with the number indicating the component that failed.

The goal of the model will be to predict when a failure will occur and which component will fail simultaneously. This will be a multiclass classification problem, though we could pivot the data to individually predict binary failure of a component instead of a machine.

Write the feature data to cloud storage

Write the final labeled feature data as parquet file an Azure blob storage container. For technical details, see: https://github.com/Azure/ViennaDocs/blob/master/Documentation/UsingBlobForStorage.md


In [24]:
# Create a new container if necessary, otherwise you can use an existing container.
# This command creates the container if it does not already exist. Else it does nothing.
az_blob_service.create_container(STORAGE_CONTAINER_NAME, 
                                 fail_on_exist=False, 
                                 public_access=PublicAccess.Container)

# Write labeled feature data to blob for use in the next notebook
labeled_features.write.mode('overwrite').parquet(FEATURES_LOCAL_DIRECT)

# Delete the old data.
for blob in az_blob_service.list_blobs(STORAGE_CONTAINER_NAME):
    if FEATURES_LOCAL_DIRECT in blob.name:
        az_blob_service.delete_blob(STORAGE_CONTAINER_NAME, blob.name)

# upload the entire folder into blob storage
for name in glob.iglob(FEATURES_LOCAL_DIRECT + '/*'):
    print(os.path.abspath(name))
    az_blob_service.create_blob_from_path(STORAGE_CONTAINER_NAME, name, name)

print("Feature engineering final dataset files saved!")

# Time the notebook execution. 
# This will only make sense if you "Run All" cells
toc = time.time()
print("Full run took %.2f minutes" % ((toc - tic)/60))

logger.log("Feature Engineering Run time", ((toc - tic)/60))


/azureml-run/featureengineering_files.parquet/part-00107-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00009-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00120-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00081-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00169-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00138-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00015-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00025-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00099-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00158-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00174-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00175-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00001-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00103-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00063-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00007-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00122-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00151-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00163-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00042-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00189-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00093-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00088-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00090-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00117-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00115-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00126-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00145-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00143-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00048-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00131-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00199-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00073-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00180-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00012-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00121-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00197-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00130-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00098-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00116-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00171-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00029-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00095-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00156-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00140-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00047-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00080-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00017-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00043-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00087-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00005-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00002-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00144-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00008-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00168-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00018-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00114-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00142-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00038-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00006-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00022-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00044-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00155-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00166-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00068-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00079-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00134-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00152-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00051-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00188-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00139-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00101-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00162-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00193-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00011-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00026-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00132-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00190-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00004-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00149-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00097-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00067-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00031-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00147-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00124-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00105-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00154-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00164-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00016-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00100-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00083-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00172-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00076-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00034-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00065-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00118-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00128-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00082-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00061-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00021-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00023-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00123-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00041-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00084-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00078-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00160-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00085-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00102-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00137-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00000-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00112-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00053-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00157-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00091-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00182-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/_SUCCESS
/azureml-run/featureengineering_files.parquet/part-00191-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00075-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00167-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00059-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00019-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00110-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00055-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00054-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00165-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00030-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00170-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00064-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00125-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00111-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00127-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00027-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00010-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00066-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00176-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00148-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00129-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00094-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00039-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00194-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00013-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00036-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00057-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00153-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00113-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00104-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00024-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00119-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00028-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00133-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00086-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00136-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00069-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00184-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00092-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00077-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00146-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00049-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00195-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00050-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00003-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00178-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00186-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00179-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00070-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00074-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00071-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00183-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00037-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00072-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00109-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00096-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00060-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00062-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00150-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00032-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00040-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00014-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00173-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00159-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00058-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00177-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00033-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00198-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00196-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00089-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00106-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00192-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00052-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00141-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00161-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00185-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00020-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00181-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00056-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00108-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00046-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00045-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00187-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00035-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
/azureml-run/featureengineering_files.parquet/part-00135-b3dd370b-0440-49fa-aaf9-dd0dad1762dc-c000.snappy.parquet
Feature engineering final dataset files saved!
Full run took 22.44 minutes
Out[24]:
<azureml.logging.script_run_request.ScriptRunRequest at 0x7f1b731572b0>

Conclusion

The next step is to build and compare machine learning models using the feature data set we have just created. The Code\3_model_building.ipynb notebook works through building a Decision Tree Classifier and a Random Forest Classifier using this data set.