This is an overview and tutorial about ServerSim, a framework for the creation of discrete event simulation models to analyze the performance, throughput, and scalability of services deployed on computer servers.
Following the overview of ServerSim, we will provide an example and tutorial of its use for the comparison of two major service deployment patterns.
This document is a Jupyter notebook. See http://jupyter.org/ for more information on Jupyter notebooks.
ServerSim is a small framework based on SimPy, a well-known discrete-event simulation framework written in the Python language. The reader should hav at least a cursory familiarity with Python and SimPy (https://simpy.readthedocs.io/en/latest/contents.html) in order to make the most of this document.
Python is well-suited to this kind of application due to its rapid development dynamic language characteristics and the availability of powerful libraries relevant for this kind of work. In addition to SimPy, we will use portions of SciPy, a powerful set of libraries for efficient data analysis and visualization that includes Matplotlib, which will be used for plotting graphs in our tutorial.
ServerSim consists of a several classes and utilities. The main classes are described below.
Represents a server -- physical, VM, or container, with a predetermined computation capacity. A server can execute arbitrary service request types. The computation capacity of a server is represented in terms of a number of hardware threads and a total speed number (computation units processed per unit of time). The total speed is equally apportioned among the hardware threads, to give the speed per hardware thread. A server also has a number of associated software threads (which must be no smaller than the number of hardware threads). Software threads are relevant for blocking computations only.
The simulations in this document assume non-blocking services, so the software threads will not be of consequence in the tutorial example.
Attributes:
A request for execution of computation units on one or more servers.
A service request submission is implemented as a SimPy Process.
A service request can be a composite of sub-requests. A service request that is part of a composite has an attribute that is a reference to its parent service request. In addition, such a composition is implemented through the gen attribute of the service request, which is a generator. That generator can yield service request submissions for other service requests.
By default, a service request is non-blocking, i.e., a thread is held on the target server only while the service request itself is executing; the thread is relinquished when the request finishes executing and it passes control to its sub-requests. However, blocking service requests can be modeled as well (see the Blkg class).
Attributes:
Base class of service requesters.
A service requester represents a service. In this framework, a service requester is a factory for service requests. "Deploying" a service on a server is modeled by having service requests produced by the service requester sent to the target server.
A service requester can be a composite of sub-requesters, thus representing a composite service.
Attributes:
Represents a set of identical users or clients that submit service requests.
Each user repeatedly submits service requests produced by service requesters randomly selected from the set of service requesters specified for the group.
Attributes:
This is the core service requester implementation that interacts with servers to utilize server resources.
All service requesters are either instances of this class or composites of such instances created using the various service requester combinators in this module
Attributes:
Following are other service requester classes (subclasses of SvcRequester) in addition to CoreSvcRequester, that can be used to define more complex services, including blocking services, asynchronous fire-and-forget services, sequentially dependednt services, parallel service calls, and service continuations. These additional classes are not used in the simulations in this document.
Wraps a service requester to produce asynchronous fire-and-forget service requests.
An asynchronous service request completes and returns immediately to the parent request, while the underlying (child) service request is scheduled for execution on its target server.
Attributes:
Wraps a service requester to produce blocking service requests.
A blocking service request will hold a software thread on the target server until the service request itself and all of its non-asynchronous sub-requests complete.
Attributes:
Combines a non-empty list of service requesters to yield a sequential composite service requester.
This service requester produces composite service requests.
A composite service request produced by this service
requester consists of a service request from each of the
provided service requesters. Each of the service requests is
submitted in sequence, i.e., each service request is
submitted when the previous one completes.
Attributes:
Combines a non-empty list of service requesters to yield a parallel composite service requester.
This service requester produces composite service requests.
A composite service request produced by this service
requester consists of a service request from each of the
provided service requesters. All of the service requests are
submitted concurrently.
When the attribute cont is True, this represents multi-threaded execution of requests on the same server. Otherwise, each service request can execute on a different server.
Attributes:
Below we compare two major service deployment patterns by using discrete-event simulations. Ideally the reader will have had some prior exposure to the Python language in order to follow along all the details. However, the concepts and conclusions should be understandable to readers with software architecture or engineering background even if not familiar with Python.
We assume an application made up of multiple multi-threaded services and consider two deployment patterns:
In the simulations below, the application is made up of just two services, to simplify the model and the analysis, but without loss of generality in terms of the main conclusions.
The code used in these simulations should be compatible with both Python 2.7 and Python 3.x.
Python and the following Python packages need to be installed in your computer:
The model in this document should be run from the parent directory of the serversim
package directory, which contains the source files for the ServerSim framework.
Following is the the core function used in the simulations This function will be called with different arguments to simulate different scenarios.
This function sets up a simulation with the following givens:
We import the required libraries, as well as the __future__
import for compatibility between Python 2.7 and Python 3.x.
In [1]:
# %load simulate_deployment_scenario.py
from __future__ import print_function
from typing import List, Tuple, Sequence
from collections import namedtuple
import random
import simpy
from serversim import *
def simulate_deployment_scenario(num_users, weight1, weight2, server_range1,
server_range2):
# type: (int, float, float, Sequence[int], Sequence[int]) -> Result
Result = namedtuple("Result", ["num_users", "weight1", "weight2", "server_range1",
"server_range2", "servers", "grp"])
def cug(mid, delta):
"""Computation units generator"""
def f():
return random.uniform(mid - delta, mid + delta)
return f
def ld_bal(svc_name):
"""Application server load-balancer."""
if svc_name == "svc_1":
svr = random.choice(servers1)
elif svc_name == "svc_2":
svr = random.choice(servers2)
else:
assert False, "Invalid service type."
return svr
simtime = 200
hw_threads = 10
sw_threads = 20
speed = 20
svc_1_comp_units = 2.0
svc_2_comp_units = 1.0
quantiles = (0.5, 0.95, 0.99)
env = simpy.Environment()
n_servers = max(server_range1[-1] + 1, server_range2[-1] + 1)
servers = [Server(env, hw_threads, sw_threads, speed, "AppServer_%s" % i)
for i in range(n_servers)]
servers1 = [servers[i] for i in server_range1]
servers2 = [servers[i] for i in server_range2]
svc_1 = CoreSvcRequester(env, "svc_1", cug(svc_1_comp_units,
svc_1_comp_units*.9), ld_bal)
svc_2 = CoreSvcRequester(env, "svc_2", cug(svc_2_comp_units,
svc_2_comp_units*.9), ld_bal)
weighted_txns = [(svc_1, weight1),
(svc_2, weight2)
]
min_think_time = 2.0 # .5 # 4
max_think_time = 10.0 # 1.5 # 20
svc_req_log = [] # type: List[Tuple[str, SvcRequest]]
grp = UserGroup(env, num_users, "UserTypeX", weighted_txns, min_think_time,
max_think_time, quantiles, svc_req_log)
grp.activate_users()
env.run(until=simtime)
return Result(num_users=num_users, weight1=weight1, weight2=weight2,
server_range1=server_range1, server_range2=server_range2,
servers=servers, grp=grp)
In [2]:
# %load print_results.py
from __future__ import print_function
from typing import Sequence, Any, IO
from serversim import *
def print_results(num_users=None, weight1=None, weight2=None, server_range1=None,
server_range2=None, servers=None, grp=None, fi=None):
# type: (int, float, float, Sequence[int], Sequence[int], Sequence[Server], UserGroup, IO[str]) -> None
if fi is None:
import sys
fi = sys.stdout
print("\n\n***** Start Simulation --", num_users, ",", weight1, ",", weight2, ", [", server_range1[0], ",", server_range1[-1] + 1,
") , [", server_range2[0], ",", server_range2[-1] + 1, ") *****", file=fi)
print("Simulation: num_users =", num_users, file=fi)
print("<< ServerExample >>\n", file=fi)
indent = " " * 4
print("\n" + "Servers:", file=fi)
for svr in servers:
print(indent*1 + "Server:", svr.name, file=fi)
print(indent * 2 + "max_concurrency =", svr.max_concurrency, file=fi)
print(indent * 2 + "num_threads =", svr.num_threads, file=fi)
print(indent*2 + "speed =", svr.speed, file=fi)
print(indent * 2 + "avg_process_time =", svr.avg_process_time, file=fi)
print(indent * 2 + "avg_hw_queue_time =", svr.avg_hw_queue_time, file=fi)
print(indent * 2 + "avg_thread_queue_time =", svr.avg_thread_queue_time, file=fi)
print(indent * 2 + "avg_service_time =", svr.avg_service_time, file=fi)
print(indent * 2 + "avg_hw_queue_length =", svr.avg_hw_queue_length, file=fi)
print(indent * 2 + "avg_thread_queue_length =", svr.avg_thread_queue_length, file=fi)
print(indent * 2 + "hw_queue_length =", svr.hw_queue_length, file=fi)
print(indent * 2 + "hw_in_process_count =", svr.hw_in_process_count, file=fi)
print(indent * 2 + "thread_queue_length =", svr.thread_queue_length, file=fi)
print(indent * 2 + "thread_in_use_count =", svr.thread_in_use_count, file=fi)
print(indent*2 + "utilization =", svr.utilization, file=fi)
print(indent*2 + "throughput =", svr.throughput, file=fi)
print(indent*1 + "Group:", grp.name, file=fi)
print(indent*2 + "num_users =", grp.num_users, file=fi)
print(indent*2 + "min_think_time =", grp.min_think_time, file=fi)
print(indent*2 + "max_think_time =", grp.max_think_time, file=fi)
print(indent * 2 + "responded_request_count =", grp.responded_request_count(None), file=fi)
print(indent * 2 + "unresponded_request_count =", grp.unresponded_request_count(None), file=fi)
print(indent * 2 + "avg_response_time =", grp.avg_response_time(), file=fi)
print(indent * 2 + "std_dev_response_time =", grp.std_dev_response_time(None), file=fi)
print(indent*2 + "throughput =", grp.throughput(None), file=fi)
for svc in grp.svcs:
print(indent*2 + svc.svc_name + ":", file=fi)
print(indent * 3 + "responded_request_count =", grp.responded_request_count(svc), file=fi)
print(indent * 3 + "unresponded_request_count =", grp.unresponded_request_count(svc), file=fi)
print(indent * 3 + "avg_response_time =", grp.avg_response_time(svc), file=fi)
print(indent * 3 + "std_dev_response_time =", grp.std_dev_response_time(svc), file=fi)
print(indent*3 + "throughput =", grp.throughput(svc), file=fi)
The following three functions handle mini-batching, plotting, and comparison of results.
In [3]:
# %load report_resp_times.py
from typing import TYPE_CHECKING, Sequence, Tuple
import functools as ft
from collections import OrderedDict
import matplotlib.pyplot as plt
from livestats import livestats
if TYPE_CHECKING:
from serversim import UserGroup
def minibatch_resp_times(time_resolution, grp):
# type: (float, UserGroup) -> Tuple[Sequence[float], Sequence[float], Sequence[float], Sequence[float], Sequence[float], Sequence[float]]
quantiles = [0.5, 0.95, 0.99]
xys = [(int(svc_req.time_dict["submitted"]/time_resolution),
svc_req.time_dict["completed"] - svc_req.time_dict["submitted"])
for (_, svc_req) in grp.svc_req_log
if svc_req.is_completed]
def ffold(map_, p):
x, y = p
if x not in map_:
map_[x] = livestats.LiveStats(quantiles)
map_[x].add(y)
return map_
xlvs = ft.reduce(ffold, xys, dict())
xs = xlvs.keys()
xs.sort()
counts = [xlvs[x].count for x in xs]
means = [xlvs[x].average for x in xs]
q_50 = [xlvs[x].quantiles()[0] for x in xs]
q_95 = [xlvs[x].quantiles()[1] for x in xs]
q_99 = [xlvs[x].quantiles()[2] for x in xs]
return xs, counts, means, q_50, q_95, q_99
def plot_counts_means_q95(quantiles1, quantiles2):
x = quantiles1[0] # should be same as quantiles2[0]
counts1 = quantiles1[1]
counts2 = quantiles2[1]
means1 = quantiles1[2]
means2 = quantiles2[2]
q1_95 = quantiles1[4]
q2_95 = quantiles2[4]
# Plot counts
plt.plot(x, counts1, color='b', label="Counts 1")
plt.plot(x, counts2, color='r', label="Counts 2")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel("Time buckets")
plt.ylabel("Throughput")
plt.show()
# Plot averages and 95th percentiles
plt.plot(x, means1, color='b', label="Means 1")
plt.plot(x, q1_95, color='c', label="95th Percentile 1")
plt.plot(x, means2, color='r', label="Means 2")
plt.plot(x, q2_95, color='m', label="95th Percentile 2")
# Hack to avoid duplicated labels (https://stackoverflow.com/questions/13588920/stop-matplotlib-repeating-labels-in-legend)
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), bbox_to_anchor=(1.05, 1),
loc=2, borderaxespad=0.)
plt.xlabel("Time buckets")
plt.ylabel("Response times")
plt.show()
def compare_scenarios(sc1, sc2):
grp1 = sc1.grp
grp2 = sc2.grp
quantiles1 = minibatch_resp_times(5, grp1)
quantiles2 = minibatch_resp_times(5, grp2)
plot_counts_means_q95(quantiles1, quantiles2)
In [4]:
random.seed(123456)
Several simulation scenarios are executed below. See the descriptions of the parameters and hard-coded given values of the core simulation function above.
With 10 servers and weight_1 = 2 and weight_2 = 1, this configuration supports 720 users with average response times close to the minimum possible. How did we arrive at that number? For svc_1, the heavier of the two services, the minimum possible average response time is 1 time unit (= 20 server compute units / 10 hardware threads / 2 average service compute units). One server can handle 10 concurrent svc_1 users without think time, or 60 concurrent svc_1 users with average think time of 6 time units. Thus, 10 servers can handle 600 concurrent svc_1 users. Doing the math for both services and taking their respective probabilities into account, the number of users is 720. For full details, see the spreadsheet CapacityPlanning.xlsx. Of course, due to randomness, there will be queuing and the average response times will be greater than the minimum possible. With these numbers, the servers will be running hot as there is no planned slack capacity.
In [5]:
sc1 = simulate_deployment_scenario(num_users=720, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
print_results(**sc1.__dict__)
In [6]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=720, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
random.setstate(rand_state)
sc2 = simulate_deployment_scenario(num_users=720, weight1=2, weight2=1,
server_range1=range(0, 8), server_range2=range(8, 10))
compare_scenarios(sc1, sc2)
Repeating above comparison to illustrate variability of results.
In [7]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=720, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
random.setstate(rand_state)
sc2 = simulate_deployment_scenario(num_users=720, weight1=2, weight2=1,
server_range1=range(0, 8), server_range2=range(8, 10))
compare_scenarios(sc1, sc2)
Conclusions: The results of the two deployment strategies are similar in terms of throughput, mean response times, and 95th percentile response times. This is as would be expected, since the capacities allocated under the individualized deployment strategy are proportional to the respective service loads.
In [8]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=720, weight1=5, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
random.setstate(rand_state)
sc2 = simulate_deployment_scenario(num_users=720, weight1=5, weight2=1,
server_range1=range(0, 8), server_range2=range(8, 10))
compare_scenarios(sc1, sc2)
Conclusions: The cookie-cutter deployment strategy was able to absorb the change in load mix, while the individualized strategy was not, with visibly lower throughput and higher mean and 95th percentile response times.
In [9]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=720, weight1=1, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
random.setstate(rand_state)
sc2 = simulate_deployment_scenario(num_users=720, weight1=1, weight2=1,
server_range1=range(0, 8), server_range2=range(8, 10))
compare_scenarios(sc1, sc2)
Conclusions: Again the cookie-cutter deployment strategy was able to absorb the change in load mix, while the individualized strategy was not, with visibly lower throughput and higher mean and 95th percentile response times. Notice that due to the changed load mix, the total load was lower than before and, with the same number of servers, the cookie-cutter configuration had excess capacity while the individualized configuration had excess capacity for svc_1 and insufficient capacity for svc_2.
We now continue with the weights used in Simulation 3, but adjust server capacity to account for the lower aggregate load and different load mix.
Below we have three scenarios:
Run the three scenarios:
In [10]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=720, weight1=1, weight2=1,
server_range1=range(0, 9), server_range2=range(0, 9))
random.setstate(rand_state)
sc2a = simulate_deployment_scenario(num_users=720, weight1=1, weight2=1,
server_range1=range(0, 7), server_range2=range(7, 9))
random.setstate(rand_state)
sc2b = simulate_deployment_scenario(num_users=720, weight1=1, weight2=1,
server_range1=range(0, 6), server_range2=range(6, 9))
Compare the results of scenarios 1 and 2a:
In [11]:
compare_scenarios(sc1, sc2a)
Compare the results of scenarios 1 and 2b:
In [12]:
compare_scenarios(sc1, sc2b)
Conclusions: Scenario 1 performs significantly than better Scenario 2a and comparably to Scenario 2b. This simulation shows again that the cookie-cutter strategy is comparable in performance and throughput to a tuned individualized configuration, and beats hands-down an individualized configuration that is not perfectly tuned for the load mix.
In [13]:
users_curve = [(0, 900), (50, 540), (100, 900), (150, 540)]
In [14]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=users_curve, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(0, 10))
random.setstate(rand_state)
sc2 = simulate_deployment_scenario(num_users=users_curve, weight1=2, weight2=1,
server_range1=range(0, 8), server_range2=range(8, 10))
compare_scenarios(sc1, sc2)
Conclusions: The cookie-cutter and individualized strategies produced similar results.
We now run a simulation similar to Simulation 4, with the difference that the number of users varies over time. This combines load variability over time as well as a change in load mix. As in Simulation 4, we adjust server capacity to account for the lower aggregate load and different load mix.
Below we have three scenarios:
Run the three scenarios:
In [15]:
rand_state = random.getstate()
sc1 = simulate_deployment_scenario(num_users=users_curve, weight1=1, weight2=1,
server_range1=range(0, 9), server_range2=range(0, 9))
random.setstate(rand_state)
sc2a = simulate_deployment_scenario(num_users=users_curve, weight1=1, weight2=1,
server_range1=range(0, 7), server_range2=range(7, 9))
random.setstate(rand_state)
sc2b = simulate_deployment_scenario(num_users=users_curve, weight1=1, weight2=1,
server_range1=range(0, 6), server_range2=range(6, 9))
Compare the results of scenarios 1 and 2a:
In [16]:
compare_scenarios(sc1, sc2a)
Compare the results of scenarios 1 and 2b:
In [17]:
compare_scenarios(sc1, sc2b)
Conclusions: Scenario 1 performs significantly better than Scenario 2a and comparably to Scenario 2b. This simulation shows again that the cookie-cutter strategy is comparable in performance and throughput to a tuned individualized configuration, and beats an individualized configuration that is not perfectly tuned for the load mix.
This final simulation is similar to Simulation 1, with the difference that the number of users is 864 instad of 720. In this scenario, the total number of servers required for best capacity utilization can be calculated to be 12 (see CapacityPlanning.xlsx). Under the individualized deployment strategy, the ideal number of servers allocated to svc_1 and svc_2 would be 9.6 and 2.4, respectively. Since the number of servers needs to be an integer, we will run simulations with server allocations to svc_1 and svc_2, respectively, of 10 and 2, 9 and 3, and 10 and 3.
Thus, we have five scenarios:
Run the scenarios:
In [18]:
rand_state = random.getstate()
sc1a = simulate_deployment_scenario(num_users=864, weight1=2, weight2=1,
server_range1=range(0, 12), server_range2=range(0, 12))
random.setstate(rand_state)
sc2a1 = simulate_deployment_scenario(num_users=864, weight1=2, weight2=1,
server_range1=range(0, 9), server_range2=range(9, 12))
random.setstate(rand_state)
sc2a2 = simulate_deployment_scenario(num_users=864, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(10, 12))
random.setstate(rand_state)
sc1b = simulate_deployment_scenario(num_users=864, weight1=2, weight2=1,
server_range1=range(0, 13), server_range2=range(0, 13))
random.setstate(rand_state)
sc2b = simulate_deployment_scenario(num_users=864, weight1=2, weight2=1,
server_range1=range(0, 10), server_range2=range(10, 13))
Compare the results of scenarios 1a and 2a1:
In [19]:
compare_scenarios(sc1a, sc2a1)
Compare the results of scenarios 1a and 2a2:
In [20]:
compare_scenarios(sc1a, sc2a2)
Compare the results of scenarios 1b and 2b:
In [21]:
compare_scenarios(sc1b, sc2b)
Conclusions: Scenario 1a has comparable throughput but somewhat better response times than Scenario 2a1. Scenario 1a has somewhat better throughput and response times than Scenario 2a2. Scenario 1b has comparable throughput and a bit less extreme response times than Scenario 2b. In all three comparisons, the cookie-cutter strategy performs better than or comparably to the individualized strategy.
The various simulations show consistently that the cookie-cutter strategy is comparable in performance and throughput (and therefore hardware utilization) to a tuned individualized configuration, and beats an individualized configuration that is not well-tuned for the load mix. Cookie-cutter thus proves to be a more robust and stable deployment strategy in many realistic situations, in the face of likely load mix fluctuations, mismatches between forecast average load mixes and actual average load mixes, and mismatches between forecast load mixes and allocated server capacities. However, although not highlighted on the simulation graphs presented, it is a fact (that can be observed in the simulation logs) that response times for svc_2 are better under a well-tuned individualized configuration because then svc_2 requests don't have to share a queue with longer-running svc_1 requests. When that's an important consideration, an individualized deployment strategy could be a more appropriate choice.