In telco, accurate forecast of KPIs (e.g. network traffic, utilizations, user experience, etc.) for communication networks ( 2G/3G/4G/5G/wired) can help predict network failures, allocate resource, or save energy.
In this notebook, we demostrate a reference use case where we use the network traffic KPI(s) in the past to predict traffic KPI(s) in the future. We demostrate how to do univariate forecasting (predict only 1 series), and multivariate forecasting (predicts more than 1 series at the same time) using Project Zouwu.
For demonstration, we use the publicly available network traffic data repository maintained by the WIDE project and in particular, the network traffic traces aggregated every 2 hours (i.e. AverageRate in Mbps/Gbps and Total Bytes) in year 2018 and 2019 at the transit link of WIDE to the upstream ISP (dataset link).
This section defines some helper functions to be used in the following procedures. You can refer to it later when they're used.
In [1]:
def get_drop_dates_and_len(df, target_col="total",allow_missing_num=3):
"""
Find missing values and get records to drop
"""
missing_num = df[target_col].isnull().astype(int).groupby(df[target_col].notnull().astype(int).cumsum()).sum()
missing_num = missing_num[missing_num > 0]
drop_datetimes = df.iloc[(missing_num.index + missing_num.cumsum() - missing_num).values].index
drop_datetimes = drop_datetimes[missing_num > allow_missing_num]
drop_missing_num = missing_num[missing_num > allow_missing_num]
drop_len = drop_missing_num.values
return drop_datetimes, drop_len
In [2]:
def rm_missing_weeks(start_dts, missing_lens, df):
"""
Drop weeks that contains more than 3 consecutive missing values.
If consecutive missing values across weeks, we remove all the weeks.
"""
for start_time, l in zip(start_dts, missing_lens):
start = start_time - pd.Timedelta(days=start_time.dayofweek)
start = start.replace(hour=0, minute=0, second=0)
start_week_end = start + pd.Timedelta(days=6)
start_week_end = start_week_end.replace(hour=22, minute=0, second=0)
end_time = start_time + l*pd.Timedelta(hours=2)
if start_week_end < end_time:
end = end_time + pd.Timedelta(days=6-end_time.dayofweek)
end = end.replace(hour=22, minute=0, second=0)
else:
end = start_week_end
df = df.drop(df[start:end].index)
return df
In [3]:
def gen_dataset_matrix(dataset, look_back, target_col_indexes):
"""
Generate input samples from rolling
"""
X, Y = [], []
if len(target_col_indexes) == 1:
del_col_index = {0, 1}.difference(set(target_col_indexes)).pop()
data = np.delete(dataset, del_col_index, axis=1)
else:
data = dataset
for i in range(len(data) - look_back):
X.append(data[i: (i + look_back)])
Y.append(data[i + look_back, target_col_indexes])
return np.array(X), np.array(Y)
In [4]:
def unscale(scaler, y, target_col_indexes):
"""
data needs to be normalized (scaled) before feeding into models.
This is to inverse the effect of normlization to get reasonable forecast results.
"""
dummy_feature_shape = scaler.scale_.shape[0]
y_dummy = np.zeros((y.shape[0], dummy_feature_shape))
y_dummy[:, target_col_indexes] = y
y_unscale = scaler.inverse_transform(y_dummy)[:,target_col_indexes]
return y_unscale
In [5]:
EPSILON = 1e-10
def sMAPE(y_true, y_pred, multioutput="uniform_average"):
"""
Symmetric Mean Average Percentage Error
"""
if multioutput not in ["raw_values", "uniform_average"]:
raise ValueError("multioutput must be 'raw_values' or 'uniform_average', got {}".format(multioutput))
output_errors = np.mean(100 * np.abs(y_true - y_pred)/(np.abs(y_true) + np.abs(y_pred) + EPSILON), axis=0,)
if multioutput == "raw_values":
return output_errors
return np.mean(output_errors)
In [6]:
def plot_predict_actual_values(date, y_pred, y_test, ylabel):
"""
plot the predicted values and actual values (for the test data)
"""
fig, axs = plt.subplots(figsize=(16,6))
axs.plot(date, y_pred,color='red', label='predicted values')
axs.plot(date, y_test,color='blue', label='actual values')
axs.set_title('the predicted values and actual values (for the test data)')
plt.xlabel('test datetime')
plt.ylabel(ylabel)
plt.legend(loc='upper left')
plt.show()
Now we download the dataset and load it into a pandas dataframe. Steps are as below.
First, run the script get_data.sh
to download the raw data. It will download the monthly aggregated traffic data in year 2018 and 2019 into data
folder. The raw data contains aggregated network traffic (average MBPs and total bytes) as well as other metrics.
Second, run extract_data.sh
to extract relavant traffic KPI's from raw data, i.e. AvgRate
for average use rate, and total
for total bytes. The script will extract the KPI's with timestamps into data/data.csv
.
Finally, use pandas to load data/data.csv
into a dataframe as shown below
In [7]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [8]:
raw_df = pd.read_csv("data/data.csv")
Below are some example records of the data
In [9]:
raw_df.head()
Out[9]:
Now we need to do data cleaning and preprocessing on the raw data. Note that this part could vary for different dataset.
For the network traffic data we're using, the processing contains 3 parts:
AvgRate
value - some uses Mbps, some uses Gbps
In [10]:
df = pd.DataFrame(pd.to_datetime(raw_df.StartTime))
In [11]:
# we can find 'AvgRate' is of two scales: 'Mbps' and 'Gbps'
raw_df.AvgRate.str[-4:].unique()
Out[11]:
In [12]:
# Unify AvgRate value
df['AvgRate'] = raw_df.AvgRate.apply(lambda x: float(x[:-4]) if x.endswith("Mbps") else float(x[:-4]) * 1000)
In [13]:
df["total"] = raw_df["total"]
df.set_index("StartTime", inplace=True)
In [14]:
df.head()
Out[14]:
In [15]:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='2H')
df = df.reindex(full_idx)
print("no. of n/a values:")
print(df.isna().sum())
Here, we drop weeks with more than 3 consecutive missing values and fill other missing values remained.
In [16]:
drop_dts, drop_len = get_drop_dates_and_len(df)
df = rm_missing_weeks(drop_dts, drop_len, df)
In [17]:
df.ffill(inplace=True)
Plot the data to see how the KPI's look like
In [18]:
ax = df.plot(y='AvgRate',figsize=(16,6), title="AvgRate of network traffic data")
In [19]:
ax = df.plot(y='total',figsize=(16,6), title="total bytes of network traffic data")
For feature engineering, we use year, month, week, day of week and hour as features in addition to the target KPI values
In [20]:
df["year"] = df.index.year
df["month"] = df.index.month
df["week"] = df.index.week
df["dayofweek"] = df.index.dayofweek
df["hour"] = df.index.hour
In [21]:
df.head()
Out[21]:
Now we split the dataset into train and test.
In [22]:
test_num = 1000
# we look back one week data which is of the frequency of 2h.
look_back = 84
test_split_index = test_num + look_back
target_dim = 1
#train_df
train_df = df[:-test_num]
test_df = df[-test_split_index:]
test_df = test_df.reset_index(drop=True)
Then standardize train, test data and featues
In [23]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
scaled_train = standard_scaler.fit_transform(train_df)
scaled_test = standard_scaler.transform(test_df)
Last, we generate model input samples by sliding window along time axis. Since univariant and multivariant forcasting uses different input data shape, we prepare data for each type of forcasting respectively.
AvgRate
only.
In [24]:
# for univeriate
# here we take "AvgRate" as an example
uni_target_col_indexes = [0]
uni_target_value = "AvgRate"
x_train, y_train = gen_dataset_matrix(scaled_train, look_back, uni_target_col_indexes)
x_test, y_test = gen_dataset_matrix(scaled_test, look_back, uni_target_col_indexes)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
Out[24]:
AvgRate
and total
at the same time.
In [25]:
# for multivariate
multi_target_col_indexes = [0, 1]
multi_target_value = ["AvgRate","total"]
x_train_m, y_train_m = gen_dataset_matrix(scaled_train, look_back, multi_target_col_indexes)
x_test_m, y_test_m = gen_dataset_matrix(scaled_test, look_back, multi_target_col_indexes)
x_train_m.shape, y_train_m.shape, x_test_m.shape, y_test_m.shape
Out[25]:
For univariate forcasting, we use LSTMForecaster for forecasting.
In [26]:
from zoo.zouwu.model.forecast import LSTMForecaster
First we initiate a LSTMForecaster.
feature_dim
should match the training data input feature, so we just use the last dimension of train data shape. target_dim
equals the variate num we want to predict. We set target_dim=1 for univariate forecasting.
In [27]:
# build model
lstm_config = {"lstm_1_units":32, "lstm_2_units":32, "lr":0.001}
forecaster = LSTMForecaster(target_dim=1, feature_dim=x_train.shape[-1], **lstm_config)
Then we use fit to train the model. Wait sometime for it to finish.
In [28]:
%%time
forecaster.fit(x=x_train, y=y_train, batch_size=1024, epochs=50, distributed=False)
After training is finished. You can use the forecaster to do prediction and evaluation.
In [29]:
# make prediction
y_pred = forecaster.predict(x_test)
Since we have used standard scaler to scale the input data (including the target values), we need to inverse the scaling on the predicted values too.
In [30]:
y_pred_unscale = unscale(standard_scaler, y_pred, uni_target_col_indexes)
y_test_unscale = unscale(standard_scaler, y_test, uni_target_col_indexes)
calculate the symetric mean absolute percentage error.
In [31]:
# evaluate with sMAPE
print("sMAPE is", sMAPE(y_test_unscale, y_pred_unscale))
# evaluate with mean_squared_error
from sklearn.metrics import mean_squared_error
print("mean_squared error is", mean_squared_error(y_test_unscale, y_pred_unscale))
For multivariate forecasting, we use MTNetForecaster for forecasting.
In [32]:
from zoo.zouwu.model.forecast import MTNetForecaster
First, we initialize a mtnet_forecaster according to input data shape. The lookback length is equal to (long_series_num+1)*series_length
Details refer to zouwu docs.
In [33]:
mtnet_forecaster = MTNetForecaster(target_dim=y_train_m.shape[-1],
feature_dim=x_train_m.shape[-1],
long_series_num=6,
series_length=12,
ar_window_size=6,
cnn_height=4
)
MTNet needs to preprocess the X into another format, so we call MTNetForecaster.preprocess_input
on train_x and test_x.
In [34]:
# mtnet requires reshape of input x before feeding into model.
x_train_mtnet = mtnet_forecaster.preprocess_input(x_train_m)
x_test_mtnet = mtnet_forecaster.preprocess_input(x_test_m)
Now we train the model and wait till it finished.
In [35]:
%%time
hist = mtnet_forecaster.fit(x = x_train_mtnet, y = y_train, batch_size=1024, epochs=20)
Use the model for prediction and inverse the scaling of the prediction results
In [36]:
y_pred_m = mtnet_forecaster.predict(x_test_mtnet)
In [37]:
y_pred_m_unscale = unscale(standard_scaler, y_pred_m, multi_target_col_indexes)
y_test_m_unscale = unscale(standard_scaler, y_test_m, multi_target_col_indexes)
In [38]:
# evaluate with sMAPE
print("sMAPE is",sMAPE(y_pred_m_unscale, y_test_m_unscale, multioutput="raw_values"))
# evaluate with mean_squared_error
print("mean squared error is", mean_squared_error(y_test_m_unscale, y_pred_m_unscale, multioutput="raw_values"))
plot actual and prediction values for AvgRate
KPI
In [40]:
test_date=df[-test_num:].index
plot_predict_actual_values(test_date, y_pred_m_unscale[:,0], y_test_m_unscale[:,0], ylabel=multi_target_value[0])
plot actual and prediction values for total bytes
KPI
In [41]:
plot_predict_actual_values(test_date, y_pred_m_unscale[:,1], y_test_m_unscale[:,1], ylabel=multi_target_value[1])
In [ ]: