Predicting delayed flights with classification analysis

This is a companion Jupyter notebook to the documentation example.

Let’s try to predict whether a flight will be delayed or not by using the sample flight data. We want to be able to use information such as weather conditions, carrier, flight distance, origin, or destination to predict flight delays. There are only two possible outcome values: the flight is either delayed or not, therefore we use binary classification to make the prediction.

We have chosen this dataset as an example because it is easily accessible for Kibana users and the use case is relevant. However, the data has been manually created and contains some inconsistencies. For example, a flight can be both delayed and canceled. Please remember that the quality of your input data will affect the quality of results.

Each document in the dataset contains details for a single flight, so this data is ready for analysis as it is already in a two-dimensional entity-based data structure (data frame). In general, you often need to transform the data into an entity-centric index before you analyze the data.



In [7]:

    
## imports
import pprint

from elasticsearch import Elasticsearch
import requests
## create a client to connect to Elasticsearch
es_url = 'http://localhost:9200'
es_client = Elasticsearch()

Example Document



In [25]:

    
## insert example of reading docs from ES index

results = es_client.search(index='kibana_sample_data_flights', filter_path=['hits.hits._*'], size=1)
results









    Out[25]:





{'hits': {'hits': [{'_index': 'kibana_sample_data_flights',
    '_id': 'HZL81W4BszKtAdTQ9e-h',
    '_score': 1.0,
    '_source': {'FlightNum': '9HY9SWR',
     'DestCountry': 'AU',
     'OriginWeather': 'Sunny',
     'OriginCityName': 'Frankfurt am Main',
     'AvgTicketPrice': 841.2656419677076,
     'DistanceMiles': 10247.856675613455,
     'FlightDelay': False,
     'DestWeather': 'Rain',
     'Dest': 'Sydney Kingsford Smith International Airport',
     'FlightDelayType': 'No Delay',
     'OriginCountry': 'DE',
     'dayOfWeek': 0,
     'DistanceKilometers': 16492.32665375846,
     'timestamp': '2019-11-25T00:00:00',
     'DestLocation': {'lat': '-33.94609833', 'lon': '151.177002'},
     'DestAirportID': 'SYD',
     'Carrier': 'Kibana Airlines',
     'Cancelled': False,
     'FlightTimeMin': 1030.7704158599038,
     'Origin': 'Frankfurt am Main Airport',
     'OriginLocation': {'lat': '50.033333', 'lon': '8.570556'},
     'DestRegion': 'SE-BD',
     'OriginAirportID': 'FRA',
     'OriginRegion': 'DE-HE',
     'DestCityName': 'Sydney',
     'FlightTimeHour': 17.179506930998397,
     'FlightDelayMin': 0}}]}}

Notice that each document contains a FlightDelay field with a boolean value. Classification is a supervised machine learning analysis and therefore needs to train on data that contains the ground truth, known as the dependent_variable. In this example, the ground truth is available in each document as the actual value of FlightDelay. In order to be analyzed, a document must contain at least one field with a supported data type (numeric, boolean, text, keyword or ip) and must not contain arrays with more than one item.

If your source data consists of some documents that contain a dependent variable and some that do not, the model is trained on the subset of documents that contain ground truth. By default, all of that subset of documents is used for training. However, you can choose to specify a percentage of the documents as your training data. Predictions are made against all of the data. The current implementation of classification analysis supports a single batch analysis for both training and predictions.

Creating a Classification Model



In [31]:

    
# 1. Creating a classification job 

endpoint_url = "/_ml/data_frame/analytics/model-flight-delay-classification"

job_config = {
  "source": {
    "index": [
      "kibana_sample_data_flights"  
    ]
  },
  "dest": {
    "index": "df-flight-delayed",  
    "results_field": "ml" 
  },
  "analysis": {
    "classification": {
      "dependent_variable": "FlightDelay",  
      "training_percent": 10  # see comment below on training percent
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": [    
      "Cancelled",
      "FlightDelayMin",
      "FlightDelayType"
    ]
  },
  "model_memory_limit": "100mb"}

result = requests.put(es_url+endpoint_url, json=job_config)
pprint.pprint(result.json())









    



{'allow_lazy_start': False,
 'analysis': {'classification': {'dependent_variable': 'FlightDelay',
                                 'num_top_classes': 2,
                                 'prediction_field_name': 'FlightDelay_prediction',
                                 'training_percent': 10.0}},
 'analyzed_fields': {'excludes': ['Cancelled',
                                  'FlightDelayMin',
                                  'FlightDelayType'],
                     'includes': []},
 'create_time': 1576584651508,
 'dest': {'index': 'df-flight-delayed', 'results_field': 'ml'},
 'id': 'model-flight-delay-classification',
 'model_memory_limit': '100mb',
 'source': {'index': ['kibana_sample_data_flights'],
            'query': {'match_all': {}}},
 'version': '8.0.0'}

A brief note on training percentage

As you may have noticed, in the job configuration above we set the value of training_percent to 10. This means that out of the whole Flights dataset 10 percent of the data will be used to train model and the remaining 90 percent of the data will be used for testing the model. You might wonder at this point, what is the best percentage for the train/test split and how you should choose what percentage to use in your own job? The answer will usually depend on your particular situation. In general it is useful to consider some of the following tradeoffs. The more data you supply to the model at training time, the more examples the model will have to learn from, which usually leads to a better classification performance. However, more training data will also increase the training time of the model and at some point, providing the model with more training examples will only result in marginal increase in accuracy.

Moreover, the more data you use for training, the less data you have for the testing phase. This means that you will have less previously unseen examples to show your model and thus perhaps your estimate for the generalization error will not be as accurate.

In general, for datasets containing several thousand docs or more, start with a low 5-10% training percentage and see how your results and runtime evolve as you increase the training percentage.



In [33]:

    
# 2. Start the job

start_endpoint = "/_ml/data_frame/analytics/model-flight-delay-classification/_start"
result = requests.post(es_url+start_endpoint)
pprint.pprint(result.json())









    



{'acknowledged': True}

The job takes a few minutes to run. Runtime depends on the local hardware and also on the number of documents and fields that are analyzed. The more fields and documents, the longer the job runs.



In [13]:

    
# 3. Check the job stats

stats_endpoint = "/_ml/data_frame/analytics/model-flight-delay-classification/_stats"
result = requests.get(es_url+stats_endpoint)
pprint.pprint(result.json())









    



{'count': 1,
 'data_frame_analytics': [{'id': 'model-flight-delay-classification',
                           'progress': [{'phase': 'reindexing',
                                         'progress_percent': 100},
                                        {'phase': 'loading_data',
                                         'progress_percent': 100},
                                        {'phase': 'analyzing',
                                         'progress_percent': 100},
                                        {'phase': 'writing_results',
                                         'progress_percent': 100}],
                           'state': 'stopped'}]}

View Classification Results



In [30]:

    
# insert code to get results
query = {"query": {"term": {"ml.is_training": {"value": False }}}}
result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1, body=query)
result









    Out[30]:





{'hits': {'hits': [{'_index': 'df-flight-delayed',
    '_id': '-5L81W4BszKtAdTQ-Pdn',
    '_score': 0.10778817,
    '_source': {'FlightNum': 'YHH7FJ3',
     'Origin': 'Rochester International Airport',
     'OriginLocation': {'lon': '-92.5', 'lat': '43.90829849'},
     'DestLocation': {'lon': '-117.5339966', 'lat': '47.61989975'},
     'FlightDelay': False,
     'DistanceMiles': 1231.1973824768306,
     'FlightTimeMin': 94.35333906213299,
     'OriginWeather': 'Clear',
     'dayOfWeek': 0,
     'AvgTicketPrice': 126.52980202058899,
     'Carrier': 'Logstash Airways',
     'FlightDelayMin': 0,
     'OriginRegion': 'US-MN',
     'FlightDelayType': 'No Delay',
     'DestAirportID': 'GEG',
     'timestamp': '2019-12-02T12:16:06',
     'Dest': 'Spokane International Airport',
     'FlightTimeHour': 1.5725556510355498,
     'Cancelled': False,
     'DistanceKilometers': 1981.4201203047928,
     'OriginCityName': 'Rochester',
     'DestWeather': 'Cloudy',
     'OriginCountry': 'US',
     'ml__id_copy': '-5L81W4BszKtAdTQ-Pdn',
     'DestCountry': 'US',
     'DestRegion': 'US-WA',
     'OriginAirportID': 'RST',
     'DestCityName': 'Spokane',
     'ml': {'top_classes': [{'class_probability': 0.9876257877232794,
        'class_name': 'false'},
       {'class_probability': 0.012374212276720642, 'class_name': 'true'}],
      'FlightDelay_prediction': 'false',
      'prediction_probability': 0.9876257877232794,
      'is_training': False}}}]}}

The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: true and false. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, true has a class_probability of 0.92 while false has only 0.08, so the prediction will be true which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident.

Evaluating Results

The results can be evaluated for documents which contain both the ground truth field and the prediction. In the example below, FlightDelay contains the ground truth and the prediction is stored as FlightDelay_prediction.

We use the data frame analytics evaluate API to evaluate the results. First, we want to know the training error that represents how well the model performed on the training dataset. In the previous step, we saw that the new index contained a field that indicated which documents were used as training data, which we can now use to calculate the training error:



In [18]:

    
# compute the training error

evaluate_endpoint = "/_ml/data_frame/_evaluate"

config = {
 "index": "df-flight-delayed",  
   "query": {
    "term": {
      "ml.is_training": {
        "value": True  
      }
    }
  },
 "evaluation": {
   "classification": {
     "actual_field": "FlightDelay",  
     "predicted_field": "ml.FlightDelay_prediction",  
     "metrics": {
       "multiclass_confusion_matrix" : {}
     }
   }
 }
}

result = requests.post(es_url+evaluate_endpoint, json=config)
result.json()









    Out[18]:





{'classification': {'multiclass_confusion_matrix': {'confusion_matrix': [{'actual_class': 'false',
     'actual_class_doc_count': 1000,
     'predicted_classes': [{'predicted_class': 'false', 'count': 904},
      {'predicted_class': 'true', 'count': 96}],
     'other_predicted_class_doc_count': 0},
    {'actual_class': 'true',
     'actual_class_doc_count': 334,
     'predicted_classes': [{'predicted_class': 'false', 'count': 20},
      {'predicted_class': 'true', 'count': 314}],
     'other_predicted_class_doc_count': 0}],
   'other_actual_class_count': 0}}}

Next, we calculate the generalization error that represents how well the model performed on previously unseen data. The returned confusion matrix shows us how many datapoints were classified correctly (where the actual_class matches the predicted_class) and how many were misclassified (actual_class does not match predicted_class):



In [19]:

    
# compute the generalization error

config = {
 "index": "df-flight-delayed",  
   "query": {
    "term": {
      "ml.is_training": {
        "value": False
      }
    }
  },
 "evaluation": {
   "classification": {
     "actual_field": "FlightDelay",  
     "predicted_field": "ml.FlightDelay_prediction",  
     "metrics": {
       "multiclass_confusion_matrix" : {}
     }
   }
 }
}

result = requests.post(es_url+evaluate_endpoint, json=config)
result.json()









    Out[19]:





{'classification': {'multiclass_confusion_matrix': {'confusion_matrix': [{'actual_class': 'false',
     'actual_class_doc_count': 8779,
     'predicted_classes': [{'predicted_class': 'false', 'count': 7176},
      {'predicted_class': 'true', 'count': 1603}],
     'other_predicted_class_doc_count': 0},
    {'actual_class': 'true',
     'actual_class_doc_count': 2946,
     'predicted_classes': [{'predicted_class': 'false', 'count': 963},
      {'predicted_class': 'true', 'count': 1983}],
     'other_predicted_class_doc_count': 0}],
   'other_actual_class_count': 0}}}