In this tutorial, we show how to transform the time-series data in the following ways:
split time-series with a lot of data points into mutiple segments, andtime shift the above segments so that they have the same timestamps.The above transformations allow you to easily compare the data over the last hour/day/week over the previous intervals. This helps you understand if your system's current behavior continues to match the past behavior.
Note: This tutorial reads in metric data from the Monitoring API, or a Google Cloud Storage bucket:
'common_prefix' is set, the data is read from the Monitoring API.'common_prefix' is not set, the data is loaded from a shared Cloud Storage bucket. See here to learn more about the Storage API.
In [1]:
from datalab.stackdriver import monitoring as gcm
# set_datalab_project_id('my-project-id')
In [2]:
import collections
# Initialize the query for CPU utilization over the last week, and read in its metadata.
query_cpu = gcm.Query('compute.googleapis.com/instance/cpu/utilization', hours=7*24)
cpu_metadata = query_cpu.metadata()
# Count the occurrences of each prefix, and display the top 5.
instance_prefix_counts = collections.Counter(
timeseries.metric.labels['instance_name'].rsplit('-', 1)[0]
for timeseries in cpu_metadata)
instance_prefix_counts.most_common(5)
Out[2]:
In this cell, you can select an instance name prefix to filter on. If you do not set this variable, then the data is read from a Cloud Storage bucket.
You can look at the most frequent prefix in the previous cell. It is recommended that you select a prefix with the following properties:
In [3]:
# Set this variable to read data from your own project.
common_prefix = None # 'my-instance-prefix'
if common_prefix is None:
print('No prefix specified. The data will be read from a Cloud Storage bucket.')
else:
print('You selected the prefix: "%s"' % (common_prefix,))
Based on the value of 'common_prefix' in the previous cell, the time series data is loaded from the Monitoring API, or a shared Cloud Storage bucket.
In both cases, we load the time series of the CPU Utilization metric over the last week, aggregated to hourly intervals per zone.
In [4]:
import StringIO
import pandas
import datalab.storage as storage
if common_prefix is None:
print('Reading in data from a Cloud Storage Bucket')
# Initialize the bucket name, and item key.
bucket_name = 'cloud-datalab-samples'
per_zone_data = 'stackdriver-monitoring/timeseries/per-zone-weekly-20161010.csv'
# Load the CSV from the bucket, and intialize the dataframe using it.
per_zone_data_item = storage.Bucket(bucket_name).item(per_zone_data)
per_zone_data_string = StringIO.StringIO(per_zone_data_item.read_from())
per_zone_cpu_data = pandas.DataFrame.from_csv(per_zone_data_string)
else:
print('Reading in data from the Monitoring API')
# Filter the query to instances with the specified prefix.
query_cpu = query_cpu.select_metrics(instance_name_prefix=common_prefix)
# Aggregate to hourly intervals per zone.
query_cpu = query_cpu.align(gcm.Aligner.ALIGN_MEAN, hours=1)
query_cpu = query_cpu.reduce(gcm.Reducer.REDUCE_MEAN, 'resource.zone')
# Get the time series data as a dataframe, with a single-level header.
per_zone_cpu_data = query_cpu.as_dataframe(label='zone')
per_zone_cpu_data.tail(5)
Out[4]:
In [5]:
import collections
# Extract the number of days in the dataframe.
num_days = len(per_zone_cpu_data.index)/24
# Split the big dataframe into daily dataframes.
daily_dataframes = [per_zone_cpu_data.iloc[24*i: 24*(i+1)]
for i in xrange(num_days)]
# Reverse the list to have today's data in the first index.
daily_dataframes.reverse()
# Display the last five rows from today's data.
daily_dataframes[0].tail(5)
Out[5]:
In [6]:
TODAY = 'Today'
# Helper function to make a readable day name based on offset from today.
def make_day_name(offset):
if offset == 0:
return TODAY
elif offset == 1:
return 'Yesterday'
return '%d days ago' % (offset,)
The pandas method tshift lets you shift a dataframe by a specified offset. We use this to shift the index of all days to match the timestamps in the latest day.
The data for each zone is inserted in a differenct dataframe, where the rows are timestamps and columns are specific days.
In [7]:
# Extract the zone names.
all_zones = per_zone_cpu_data.columns.tolist()
# Use the last day's timestamps as the index, and initialize a dataframe per zone.
last_day_index = daily_dataframes[0].index
zone_to_shifted_df = {zone: pandas.DataFrame([], index=last_day_index)
for zone in all_zones}
for i, dataframe in enumerate(daily_dataframes):
# Shift the dataframe to line up with the start of the last day.
dataframe = dataframe.tshift(freq=last_day_index[0] - dataframe.index[0])
current_day_name = make_day_name(i)
# Insert each daily dataframe as a column into the dataframe.
for zone in all_zones:
zone_to_shifted_df[zone][current_day_name] = dataframe[zone]
# Display the first five rows from the first zone.
zone_to_shifted_df[all_zones[0]].head(5)
Out[7]:
In [8]:
for zone, dataframe in zone_to_shifted_df.iteritems():
dataframe.plot(title=zone).legend(loc="upper left", bbox_to_anchor=(1,1))
In [9]:
for zone, dataframe in zone_to_shifted_df.iteritems():
# Initialize the dataframe by extracting the column with data for today.
compare_to_avg_df = dataframe.loc[:, [TODAY]]
# Add a column with the weekly avg.
compare_to_avg_df['Weekly avg.'] = dataframe.mean(axis=1)
# Plot this dataframe.
compare_to_avg_df.plot(title=zone).legend(loc="upper left", bbox_to_anchor=(1,1))