SARcasm

A Jupyter notebook for helping [me] make sense of sysstat sar data.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Strategy:

Export one, or possibly more, sysstat .sa files as XML
Import the sar data from XML into pandas dataframes
Provide reference code for manipulation and visualisation of the various metrics

Tested and assumed to work with:

SLES 11 SP[34]
Ubuntu 16.4

Usage

As sysstat evolved so did the file format used to stor the sar data and unfortunately there is no guarantee that a version of sadf can read .sa files created with an earlier version of sa. Thankfully, different versions of sar data file format and sysstat version used can be determined from the file header. sadf can export .sa files into various formats. XML outpout has been available consistently since [at least] sysstat version 8.1.5 and proved to be the most umambiquous and complete format for this script's objectives.

Sysstat binaries

For every .sa file version an sadf binary capable of exporting its contents to XML is required. This mapping from file version to binary is maintained in the dict sa_exporters class variable for class SarData.

Files to analyse

Which .sa files to analyse and the directory where they are found is set in the sa_files list and the sa_directory variables respectively in the Importing sa files sections. It is sensible that consequitive daily sar files from the same host are used for a meaningful time series.

Analysing the data

The section Sanity checks outputs metadata collected from the input files and describes the data frames used to store the collected information. The Plots section provides templates for frequently rquired plot. The idea however is for those plots to be modified or added to as required.

System requirements

Python3
Jupyter
pandas
Should be fine with Anaconda [Python3]

KNOWN ISSUES

This is a bodged together ugly hack not a proper application
We are rudely abusing XML both in principle and practice
TODO compile a table of known file formats and compatible sysstat versions
TODO include link to sysstat versions
TODO include links to various sysstat binary downloads
Heavy refactoring is required



In [1]:

    
sa_directory = './z1'
sa_files = ["sa05", "sa06", 'sa07', 'sa08', 'sa09', 'sa10', 'sa11', 'sa12', 'sa13']



In [2]:

    
import xml.etree.ElementTree as ET
import pandas as pd
import subprocess
from datetime import datetime
import matplotlib
%matplotlib inline



In [3]:

    
class SarData(object):
    """Class for importing sar data"""
    
    sa_file_magic = "96d5"
    
    sa_exporters = {
        "2170": ["./sadf.binaries/sles11sp3/usr/bin/sadf", "-x", "-t", "--", "-A"],
        "2173": ["/usr/bin/sadf", "-x", "-t", "--", "-A"]
    }

    @classmethod
    def sa_get_version(cls, filename):
        """Checks an sa file for sysstat sa magic and returns file format version"""
        with open(filename, mode="rb") as f:
            magic = f.read(2)
            if magic != bytes.fromhex(cls.sa_file_magic):
                raise TypeError("{} does not start with sa file magic. Got 0x{:02x}{:02x}".format(
                    filename, magic[1], magic[0]))
            format_version = f.read(2)
            return "{:02x}{:02x}".format(format_version[1], format_version[0])

    @classmethod
    def sa_to_xml(cls, filename):
        """Open an sa file and export it to XML"""
        sa_ver = cls.sa_get_version(filename)
        if not sa_ver in cls.sa_exporters:
            raise NotImplementedError("No exporter for {} format version {}".format(filename, sa_ver))
        return subprocess.check_output(cls.sa_exporters[sa_ver] + [filename])
    
    def __init__(self):
        self.hostname = None            # hostname within sa file
        self.aggregate0 = None          # aggregate of what we pull out of the xml
        self.aggregate1 = None
        self.hostmeta = None
        self._raw_aggregate0 = list()
        self._raw_metadata = list()
        
    def import_file(self, filename):
        """Import an sa file"""
        # XML output from sadf seems to have developed 'evolutionarily' with namespaces
        # thrown in at some point for added inconvenience.  Here we strip them out.
        raw_xml = self.sa_to_xml(filename)      # Read in the xml output of sadf
        xml_tree = ET.fromstring(raw_xml)       # Convert to ET elements
        __class__._strip_xml_ns(xml_tree)       # Remove XML namespace crap inplace
        if xml_tree.tag != "sysstat":
            raise TypeError("Expected 'sysstat' but found a root of {}".format(xml_tree.tag))
        for level1 in xml_tree:
            if level1.tag == 'sysdata-version':
                pass
            elif level1.tag == "host":
                tmp_hostmeta = {'filename': filename}
                if "nodename" in level1.attrib:
                    tmp_hostmeta['nodename'] = level1.attrib["nodename"]
                for host_child in level1:
                    if host_child.tag == "statistics":
                        for timestamp in host_child:
                            if timestamp.tag != "timestamp":
                                print("  Unexpected statistic tag: {}".timestamp.tag)
                                continue
                            datum_time = datetime.strptime(timestamp.attrib["date"] +
                                                           "T" + timestamp.attrib["time"],
                                                           "%Y-%m-%dT%H:%M:%S")
                            tmp_datum0 = {'timestamp': datum_time}
                            for metric in timestamp:
                                # Simple key value statistics for regular polls will all go
                                # into a single dict 'aggregate0'
                                aggregate0_parser = self.get_aggregate0_parser(metric.tag)
                                if aggregate0_parser:
                                    tmp_datum0.update(aggregate0_parser(metric))
                            self._raw_aggregate0.append(tmp_datum0)
                    else:
                        hostmeta_parser = self.get_hostmeta_parser(host_child.tag)
                        if hostmeta_parser:
                            tmp_hostmeta[host_child.tag] = hostmeta_parser(host_child)
                        else:
                            print("Unexpected host tag: {}".format(host_child.tag))
                self._raw_metadata.append(tmp_hostmeta)
            else:
                print("Unknown level 1: {}".format(child_tag))
        pass
        
    def dicts_to_dataframes(self):
        """Convert the various temporary lists of dicts to Data Frames"""
        self.aggregate0 = pd.DataFrame.from_dict(self._raw_aggregate0)
        self.hostmeta = pd.DataFrame.from_dict(self._raw_metadata)
    
    def fix_numeric_columns(self):
        """Convert necessary Data Frame columns from str to numeric"""
        for sa_column in self.aggregate0.columns:
            if sa_column == 'timestamp':
                continue
            sa.aggregate0[sa_column] = pd.to_numeric(sa.aggregate0[sa_column],
                                                     errors='coerce')

    @classmethod
    def _strip_xml_ns(cls, element):
        element.tag = element.tag.split('}')[-1]
        for child in element:
            cls._strip_xml_ns(child)

    @classmethod
    def get_hostmeta_parser(cls, meta_tag):
        """Returns a parser for elements under 'host'"""
        ## we can comment out any problematic ones at runtime
        meta_parsers = {
            'sysname': cls._parse_meta_default,
            'release': cls._parse_meta_default,
            'comments': cls._parse_meta_default,
            'restarts': cls._parse_meta_default,
            'machine': cls._parse_meta_default,
            'number-of-cpus': cls._parse_meta_default,
            'file-date': cls._parse_meta_default,
            'file-utc-time': cls._parse_meta_default
        }
        if meta_tag in meta_parsers:
            return meta_parsers[meta_tag]
        return None
        
    @classmethod
    def get_aggregate0_parser(cls, metric_tag):
        """Returns a parser for elements under 'host->statistics'"""
        ## we can comment out any problematic ones at runtime
        metric_parsers = {
            'queue': cls._parse_absorb_level,
            'memory': cls._parse_absorb_level,
            'process-and-context-switch': cls._parse_absorb_level,
            'hugepages': cls._parse_absorb_level,
            'paging': cls._parse_absorb_level,
            'io': cls._parse_absorb_level,
            'swap-pages': cls._parse_absorb_level,
            'kernel': cls._parse_absorb_level
        }
        if metric_tag in metric_parsers:
            return metric_parsers[metric_tag]
        return None

    @staticmethod
    def _parse_meta_default(element):
        """returns text field of a sar metadata element"""
        return element.text.strip()
    
    @staticmethod
    def _parse_absorb_level(element):
        """returns a dict of flattened attributes and element tags with some cleanup"""
        tmp_dict = dict()
        for child in element:
            if child.text:
                tmp_dict[child.tag] = child.text
            else:
                tmp_dict.update(child.attrib.copy())
        tmp_dict.update(element.attrib.copy())
        for undesired in ['per', 'unit']:
            if undesired in tmp_dict:
                tmp_dict.pop(undesired)
        return tmp_dict
    
    def plot_simple_aggregate0(self, y_vars):
        """simple timeseries from self.aggregate0 columns"""
        try:
            sa.aggregate0.plot(x=['timestamp'],
                               y=y_vars,
                               style='-',
                               figsize=(15, 10))
        except KeyError as err:
            print("KeyError: {}".format(err))

Importing `sa` files

Here we extract the various entries within the sar file and populate appropriate data frames. The process is deliberately iterative and verbose to expose the various fields and make on the fly modification or manual reproduction of steps easier. It does however need to be heavily refactored regardless.



In [4]:

    
sa = SarData()

for sa_file in map(lambda sa_file: "{}/{}".format(sa_directory, sa_file), sa_files):
    sa_xml = sa.import_file(sa_file)

sa.dicts_to_dataframes()
sa.fix_numeric_columns()

Sanity checks



In [5]:

    
sa.hostmeta









    Out[5]:







  
    
      
      file-date
      file-utc-time
      filename
      machine
      nodename
      number-of-cpus
      release
      restarts
      sysname
    
  
  
    
      0
      2017-10-05
      23:00:01
      ./z1/sa05
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      1
      2017-10-06
      23:00:01
      ./z1/sa06
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      2
      2017-10-07
      23:00:01
      ./z1/sa07
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      3
      2017-10-08
      23:00:01
      ./z1/sa08
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      4
      2017-10-09
      23:00:01
      ./z1/sa09
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      5
      2017-10-10
      23:00:01
      ./z1/sa10
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      6
      2017-10-11
      23:00:01
      ./z1/sa11
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      7
      2017-10-12
      23:00:01
      ./z1/sa12
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux
    
    
      8
      2017-10-13
      23:00:01
      ./z1/sa13
      x86_64
      elysium
      8
      4.10.0-35-generic
      
      Linux



In [6]:

    
sa.aggregate0.dtypes









    Out[6]:





active                      int64
anonpg                      int64
blocked                     int64
bread                     float64
buffers                     int64
bufpg                     float64
bwrtn                     float64
cached                      int64
campg                     float64
commit                      int64
commit-percent            float64
cswch                     float64
dentunusd                   int64
dirty                       int64
fault                     float64
file-nr                     int64
frmpg                     float64
hugfree                     int64
hugused                     int64
hugused-percent           float64
inactive                    int64
inode-nr                    int64
kstack                      int64
ldavg-1                   float64
ldavg-15                  float64
ldavg-5                   float64
majflt                    float64
memfree                     int64
memused                     int64
memused-percent           float64
pgfree                    float64
pgpgin                    float64
pgpgout                   float64
pgscand                   float64
pgscank                   float64
pgsteal                   float64
pgtbl                       int64
plist-sz                    int64
proc                      float64
pswpin                    float64
pswpout                   float64
pty-nr                      int64
rtps                      float64
runq-sz                     int64
slab                        int64
swpcad                      int64
swpcad-percent            float64
swpfree                     int64
swpused                     int64
swpused-percent           float64
timestamp          datetime64[ns]
tps                       float64
vmeff-percent             float64
vmused                      int64
wtps                      float64
dtype: object

Plots

Queue length and load averages

ldavg-1 - System load average over past 1 minute
ldavg-5 - System load average over past 5 minutes
ldavg-15 - System load average over past 15 minutes

From uptime(1):

   System load averages is the average number of processes that are either
   in a runnable or uninterruptable state.  A process in a runnable  state
   is  either using the CPU or waiting to use the CPU.  A process in unin‐
   terruptable state is waiting for some I/O access, eg waiting for  disk.
   The  averages  are  taken over the three time intervals.  Load averages
   are not normalized for the number of CPUs in a system, so a load  aver‐
   age  of 1 means a single CPU system is loaded all the time while on a 4
   CPU system it means it was idle 75% of the time.



In [7]:

    
sa.plot_simple_aggregate0(['ldavg-1', 'ldavg-5', 'ldavg-15'])

runq-sz - number of tasks waiting to run
plist-sz - number of tasks in process list



In [8]:

    
sa.plot_simple_aggregate0(['runq-sz', 'plist-sz'])

Memory utilisation statistics

memfree - kb free memory available
memused - kb memory used
buffers - kb kernel buffers buffers
cached - kb page cache and slabs
swpused - kb used swap space



In [9]:

    
sa.plot_simple_aggregate0(['memfree', 'memused', 'buffers', 'cached', 'swpused'])

memused-percent - percentage memory used
swpused-percent - percentage swap used



In [10]:

    
sa.plot_simple_aggregate0(['memused-percent', 'swpused-percent'])

Swapping statistics

pswpin -
pswpout -



In [11]:

    
sa.plot_simple_aggregate0(['pswpin', 'pswpout'])

Task creation and system switching

proc - tasks created per second



In [12]:

    
sa.plot_simple_aggregate0(['proc'])

cswch - context switches per second



In [13]:

    
sa.plot_simple_aggregate0(['cswch'])

Huge pages utilisation

hugused -
hugfree -



In [14]:

    
sa.plot_simple_aggregate0(['hugused', 'hugfree'])

hugused-percent



In [15]:

    
sa.plot_simple_aggregate0(['hugused-percent'])

IO and transfar rate statistics

tps - Transfers per second to physical devices



In [16]:

    
sa.plot_simple_aggregate0(['tps'])

rtps -
wtps -



In [17]:

    
sa.plot_simple_aggregate0(['rtps', 'wtps'])

bread -
bwrtn -



In [18]:

    
sa.plot_simple_aggregate0(['bread', 'bwrtn'])

Paging statistics

pgpgin -
pgpgout -
fault -
majflt -
pgfree -
pgscank -
pgscand -
pgsteal -



In [19]:

    
sa.plot_simple_aggregate0(['pgpgin', 'pgpgout', 'fault', 'majflt', 'pgfree', 'pgscank', 'pgscand', 'pgsteal'])

vmeff-percent -



In [20]:

    
sa.plot_simple_aggregate0(['vmeff-percent'])

Kernel table statistics

dentunusd - unused entries in directory cache



In [21]:

    
sa.plot_simple_aggregate0(['dentunusd'])

file-nr - file handles in use



In [22]:

    
sa.plot_simple_aggregate0(['file-nr'])

inode-nr - inode handles in use



In [23]:

    
sa.plot_simple_aggregate0(['inode-nr'])

pty-nr - pseudo-terninals in use



In [24]:

    
sa.plot_simple_aggregate0(['pty-nr'])

	file-date	file-utc-time	filename	machine	nodename	number-of-cpus	release	sysname
0	2017-10-05	23:00:01	./z1/sa05	x86_64	elysium	8	4.10.0-35-generic	Linux
1	2017-10-06	23:00:01	./z1/sa06	x86_64	elysium	8	4.10.0-35-generic	Linux
2	2017-10-07	23:00:01	./z1/sa07	x86_64	elysium	8	4.10.0-35-generic	Linux
3	2017-10-08	23:00:01	./z1/sa08	x86_64	elysium	8	4.10.0-35-generic	Linux
4	2017-10-09	23:00:01	./z1/sa09	x86_64	elysium	8	4.10.0-35-generic	Linux
5	2017-10-10	23:00:01	./z1/sa10	x86_64	elysium	8	4.10.0-35-generic	Linux
6	2017-10-11	23:00:01	./z1/sa11	x86_64	elysium	8	4.10.0-35-generic	Linux
7	2017-10-12	23:00:01	./z1/sa12	x86_64	elysium	8	4.10.0-35-generic	Linux
8	2017-10-13	23:00:01	./z1/sa13	x86_64	elysium	8	4.10.0-35-generic	Linux