In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Caution: In addition to python packages this notebook uses sudo apt-get install
to install third party packages.
This tutorial loads CoreDNS metrics from a Prometheus server into a tf.data.Dataset
, then uses tf.keras
for training and inference.
CoreDNS is a DNS server with a focus on service discovery, and is widely deployed as a part of the Kubernetes cluster. For that reason it is often closely monitoring by devops operations.
This tutorial is an example that could be used by devops looking for automation in their operations through machine learning.
In [0]:
import os
In [3]:
try:
%tensorflow_version 2.x
except Exception:
pass
In [4]:
!pip install tensorflow-io
In [0]:
from datetime import datetime
import tensorflow as tf
import tensorflow_io as tfio
For demo purposes, a CoreDNS server locally with port 9053
open to receive DNS queries and port 9153
(defult) open to expose metrics for scraping. The following is a basic Corefile configuration for CoreDNS and is available to download:
.:9053 {
prometheus
whoami
}
More details about installation could be found on CoreDNS's documentation.
In [6]:
!curl -s -OL https://github.com/coredns/coredns/releases/download/v1.6.7/coredns_1.6.7_linux_amd64.tgz
!tar -xzf coredns_1.6.7_linux_amd64.tgz
!curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/Corefile
!cat Corefile
In [0]:
# Run `./coredns` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./coredns &')
The next step is to setup Prometheus server and use Prometheus to scrape CoreDNS metrics that are exposed on port 9153
from above. The prometheus.yml
file for configuration is also available for download:
In [8]:
!curl -s -OL https://github.com/prometheus/prometheus/releases/download/v2.15.2/prometheus-2.15.2.linux-amd64.tar.gz
!tar -xzf prometheus-2.15.2.linux-amd64.tar.gz --strip-components=1
!curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/prometheus.yml
!cat prometheus.yml
In [0]:
# Run `./prometheus` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./prometheus &')
In order to show some activity, dig
command could be used to generate a few DNS queries against the CoreDNS server that has been setup:
In [0]:
!sudo apt-get install -y -qq dnsutils
In [11]:
!dig @127.0.0.1 -p 9053 demo1.example.org
In [12]:
!dig @127.0.0.1 -p 9053 demo2.example.org
Now a CoreDNS server whose metrics are scraped by a Prometheus server and ready to be consumed by TensorFlow.
Create a Dataset for CoreDNS metrics that is available from PostgreSQL server, could be done with tfio.experimental.IODataset.from_prometheus
. At the minimium two arguments are needed. query
is passed to Prometheus server to select the metrics and length
is the period we want to load into Dataset.
You can start with "coredns_dns_request_count_total"
and "5"
(secs) to create the Dataset below. Since earlier in the tutorial two DNS queries were sent, it is expected that the metrics for "coredns_dns_request_count_total"
will be "2.0"
at the end of the time series:
In [13]:
dataset = tfio.experimental.IODataset.from_prometheus(
"coredns_dns_request_count_total", 5, endpoint="http://localhost:9090")
print("Dataset Spec:\n{}\n".format(dataset.element_spec))
print("CoreDNS Time Series:")
for (time, value) in dataset:
# time is milli second, convert to data time:
time = datetime.fromtimestamp(time // 1000)
print("{}: {}".format(time, value['coredns']['localhost:9153']['coredns_dns_request_count_total']))
Further looking into the spec of the Dataset:
(
TensorSpec(shape=(), dtype=tf.int64, name=None),
{
'coredns': {
'localhost:9153': {
'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)
}
}
}
)
It is obvious that the dataset consists of a (time, values)
tuple where the values
field is a python dict expanded into:
"job_name": {
"instance_name": {
"metric_name": value,
},
}
In the above example, 'coredns'
is the job name, 'localhost:9153'
is the instance name, and 'coredns_dns_request_count_total'
is the metric name. Note that depending on the Prometheus query used, it is possible that multiple jobs/instances/metrics could be returned. This is also the reason why python dict has been used in the structure of the Dataset.
Take another query "go_memstats_gc_sys_bytes"
as an example. Since both CoreDNS and Prometheus are written in Golang, "go_memstats_gc_sys_bytes"
metric is available for both "coredns"
job and "prometheus"
job:
Note: This cell may error out the first time you run it. Run it again and it will pass .
In [14]:
dataset = tfio.experimental.IODataset.from_prometheus(
"go_memstats_gc_sys_bytes", 5, endpoint="http://localhost:9090")
print("Time Series CoreDNS/Prometheus Comparision:")
for (time, value) in dataset:
# time is milli second, convert to data time:
time = datetime.fromtimestamp(time // 1000)
print("{}: {}/{}".format(
time,
value['coredns']['localhost:9153']['go_memstats_gc_sys_bytes'],
value['prometheus']['localhost:9090']['go_memstats_gc_sys_bytes']))
The created Dataset
is ready to be passed to tf.keras
directly for either training or inference purposes now.
In [0]:
n_steps, n_features = 2, 1
simple_lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape=(n_steps, n_features)),
tf.keras.layers.Dense(1)
])
simple_lstm_model.compile(optimizer='adam', loss='mae')
The dataset to be used is the value of 'go_memstats_sys_bytes' for CoreDNS with 10 samples. However, since a sliding window of window=n_steps
and shift=1
are formed, additional samples are needed (for any two consecute elements, the first is taken as x
and the second is taken as y
for training). The total is 10 + n_steps - 1 + 1 = 12
seconds.
The data value is also scaled to [0, 1]
.
In [16]:
n_samples = 10
dataset = tfio.experimental.IODataset.from_prometheus(
"go_memstats_sys_bytes", n_samples + n_steps - 1 + 1, endpoint="http://localhost:9090")
# take go_memstats_gc_sys_bytes from coredns job
dataset = dataset.map(lambda _, v: v['coredns']['localhost:9153']['go_memstats_sys_bytes'])
# find the max value and scale the value to [0, 1]
v_max = dataset.reduce(tf.constant(0.0, tf.float64), tf.math.maximum)
dataset = dataset.map(lambda v: (v / v_max))
# expand the dimension by 1 to fit n_features=1
dataset = dataset.map(lambda v: tf.expand_dims(v, -1))
# take a sliding window
dataset = dataset.window(n_steps, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda d: d.batch(n_steps))
# the first value is x and the next value is y, only take 10 samples
x = dataset.take(n_samples)
y = dataset.skip(1).take(n_samples)
dataset = tf.data.Dataset.zip((x, y))
# pass the final dataset to model.fit for training
simple_lstm_model.fit(dataset.batch(1).repeat(10), epochs=5, steps_per_epoch=10)
Out[16]:
The trained model above is not very useful in reality, as the CoreDNS server that has been setup in this tutorial does not have any workload. However, this is a working pipeline that could be used to load metrics from true production servers. The model could then be improved to solve the real-world problem of devops automation.