Using hillmaker

In this notebook we'll focus on basic use of Hillmaker for analyzing occupancy in a typical hospital setting. The data is fictitious data from a hospital short stay unit. Patients flow through a short stay unit for a variety of procedures, tests or therapies. Let's assume patients can be classified into one of five categories of patient types: ART (arterialgram), CAT (post cardiac-cath), MYE (myelogram), IVT (IV therapy), and OTH (other). From one of our hospital information systems we were able to get raw data about the entry and exit times of each patient. For simplicity, the data is in a csv file.

This example assumes you are already familiar with statistical occupancy analysis using the old version of Hillmaker or some similar such tool. It also assumes some knowledge of using Python for analytical work.

The following blog posts are helpful if you are not familiar with occupancy analysis:

Computing occupancy statistics with Python - Part 1 of 3

Computing occupancy statistics with Python - Part 2 of 3

Current status of code

The new hillmaker is implemented as a Python module which can be used by importing hillmaker and then calling the main hillmaker function, make_hills() (or any component function included in the module). This new version of hillmaker is in what I'd call an alpha state. The output does match the Access version for the ShortStay database that I included in the original Hillmaker. Use at your own risk.

It is licensed under an Apache 2.0 license. It is a widely used permissive free software license. See https://en.wikipedia.org/wiki/Apache_License for additional information.

Installing Hillmaker

Whereas the old Hillmaker required MS Access, the new one requires an installation of Python 3 along with several Python modules that are widely used for analytics and data science work.

Getting Python and a jillion analytical packages via Anaconda

An very easy way to get Python 3 pre-configured with tons of analytical Python packages is to use the Anaconda distro for Python. From their Downloads page:

Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 300 of the most popular Python packages for science, math, engineering, and data analysis. See the packages included with Anaconda and the Anaconda changelog.

Make sure you download Python 3.x (3.5 is latest version as of January, 2016)

There are several really nice reasons to use the Anaconda Python distro for data science work:

  • it comes preconfigured with hundreds of the most popular data science Python packages installed and they just work
  • large community of Anaconda data science users and vibrant user community on places like StackOverflow
  • it has a companion package manager called Conda which makes it easy to install new packages as well as to create and manage virtual environments

Getting Hillmaker

As of January 22, 2016,hillmaker is publicly available from the Python Package Index known as PyPi as well as Anaconda Cloud. They are similar to CRAN for R. Source code is also be available from my GitHub site https://github.com/misken/hillmaker and it is an open-source project. If you work with Python, you should know a little bit about Python package installation. There is already a companion project on GitHub called hillmaker-examples which contains, well, examples of hillmaker use cases.

Installing Hillmaker

You can use either pip or conda to install hillmaker. I suggest learning about Python virtual environments and either using pyenv, virtualenv or conda (preferred) to create a Python virtual environment and then install hillmaker into it. This way you avoid mixing developmental third-party packages like hillmaker with your base Anaconda Python environment.

Step 1 - Open a terminal and install using Conda or Pip

To install using conda:

conda install -c https://conda.anaconda.org/hselab hillmaker

OR

To install using pip:

pip install hillmaker

Step 2 - Confirm that hillmaker was installed

Use the conda list command to see all the installed packages in your Anaconda3 root.

conda list

Now fire up a Python session and try to import hillmaker.

c:\Users\jerry\Documents\hillmaker>python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:16:01) 
[MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hillmaker
>>> help(hillmaker.make_hills)
Help on function make_hills in module hillmaker.hills:

make_hills(scenario_name, stops_df, infield, outfield, start_analysis, end_analysis, catfield='', total_str='Total', bin_size_minutes=60, cat_to_exclude=None, totals=True, export_csv=True, export_path='.', return_dataframes=False, verbose=0)
    Compute occupancy, arrival, and departure statistics by time bin of day and day of week.

    Main function that first calls `bydatetime.make_bydatetime` to calculate occupancy, arrival
    and departure values by date by time bin and then calls `summarize.summarize_bydatetime`
    to compute the summary statistics.

    Parameters
    ----------
    scenario_name : string
        Used in output filenames
    stops_df : DataFrame
        Base data containing one row per visit
    infield : string
        Column name corresponding to the arrival times

... a bunch more stuff ...

If the install went well, you shouldn't get any errors when you import hillmaker and the help() command should show you the docstring for the make_hills() function.

Using hillmaker

The rest of this IPython notebook will illustrate a few ways to use the hillmaker package.

Module imports

To run Hillmaker we only need to import a few modules. Since the main Hillmaker function uses Pandas DataFrames for both data input and output, we need to import pandas in addition to hillmaker.


In [1]:
import pandas as pd
import hillmaker as hm

Read main stop data file

Here's the first few lines from our csv file containing the patient stop data:

PatID,InRoomTS,OutRoomTS,PatType
1,1/1/1996 7:44,1/1/1996 8:50,IVT
2,1/1/1996 8:28,1/1/1996 9:20,IVT
3,1/1/1996 11:44,1/1/1996 13:30,MYE
4,1/1/1996 11:51,1/1/1996 12:55,CAT
5,1/1/1996 12:10,1/1/1996 13:00,IVT
6,1/1/1996 14:16,1/1/1996 15:35,IVT
7,1/1/1996 14:40,1/1/1996 15:25,IVT


Read the short stay data from a csv file into a DataFrame and tell Pandas which fields to treat as dates.


In [2]:
file_stopdata = '../data/ShortStay.csv'
stops_df = pd.read_csv(file_stopdata, parse_dates=['InRoomTS','OutRoomTS'])
stops_df.info() # Check out the structure of the resulting DataFrame


<class 'pandas.core.frame.DataFrame'>
Int64Index: 59877 entries, 0 to 59876
Data columns (total 4 columns):
PatID        59877 non-null int64
InRoomTS     59877 non-null datetime64[ns]
OutRoomTS    59877 non-null datetime64[ns]
PatType      59877 non-null object
dtypes: datetime64[ns](2), int64(1), object(1)
memory usage: 2.3+ MB

Check out the top and bottom of stops_df.


In [3]:
stops_df.head(7)


Out[3]:
PatID InRoomTS OutRoomTS PatType
0 1 1996-01-01 07:44:00 1996-01-01 08:50:00 IVT
1 2 1996-01-01 08:28:00 1996-01-01 09:20:00 IVT
2 3 1996-01-01 11:44:00 1996-01-01 13:30:00 MYE
3 4 1996-01-01 11:51:00 1996-01-01 12:55:00 CAT
4 5 1996-01-01 12:10:00 1996-01-01 13:00:00 IVT
5 6 1996-01-01 14:16:00 1996-01-01 15:35:00 IVT
6 7 1996-01-01 14:40:00 1996-01-01 15:25:00 IVT

In [4]:
stops_df.tail(5)


Out[4]:
PatID InRoomTS OutRoomTS PatType
59872 59873 1996-09-30 19:31:00 1996-09-30 20:15:00 IVT
59873 59874 1996-09-30 20:23:00 1996-09-30 21:30:00 IVT
59874 59875 1996-09-30 21:00:00 1996-09-30 22:45:00 CAT
59875 59876 1996-09-30 21:57:00 1996-09-30 22:40:00 IVT
59876 59877 1996-09-30 22:45:00 1996-09-30 23:35:00 CAT

No obvious problems. We'll assume the data was all read in correctly.

Creating occupancy summaries

The primary function in Hillmaker is called make_hills and plays the same role as the Hillmaker function in the original Access VBA version of Hillmaker. Let's get a little help on this function.


In [5]:
help(hm.make_hills)


Help on function make_hills in module hillmaker.hills:

make_hills(scenario_name, stops_df, infield, outfield, start_analysis, end_analysis, catfield='', total_str='Total', bin_size_minutes=60, cat_to_exclude=None, totals=True, export_csv=True, export_path='.', return_dataframes=False, verbose=0)
    Compute occupancy, arrival, and departure statistics by time bin of day and day of week.
    
    Main function that first calls `bydatetime.make_bydatetime` to calculate occupancy, arrival
    and departure values by date by time bin and then calls `summarize.summarize_bydatetime`
    to compute the summary statistics.
    
    Parameters
    ----------
    scenario_name : string
        Used in output filenames
    stops_df : DataFrame
        Base data containing one row per visit
    infield : string
        Column name corresponding to the arrival times
    outfield : string
        Column name corresponding to the departure times
    start_analysis : datetime-like, str
        Starting datetime for the analysis (must be convertible to pandas Timestamp)
    end_analysis : datetime-like, str
        Ending datetime for the analysis (must be convertible to pandas Timestamp)
    catfield : string, optional
        Column name corresponding to the category. If none is specified, then only overall occupancy is analyzed.
        Default is ''
    total_str : string, optional
        Column name to use for the overall category, default is 'Total'
    bin_size_minutes : int, optional
        Number of minutes in each time bin of the day, default is 60
    cat_to_exclude : list, optional
        Categories to ignore, default is None
    totals : bool, optional
       If true, overall totals are computed. Else, just category specific values computed. Default is True.
    export_csv : bool, optional
       If true, results DataFrames are exported to csv files. Default is True.
    export_path : string, optional
        Destination path for exported csv files, default is current directory
    return_dataframes : bool, optional
        If true, dictionary of DataFrames is returned. Default is False.
    verbose : int, optional
        The verbosity level. The default, zero, means silent mode. Higher numbers mean more output messages.
    
    Returns
    -------
    dict of DataFrames
       The bydatetime and all summaries.
    
       Only returned if return_dataframes=True
    
       Example:
    
       {'bydatetime': bydt_df,
        'occupancy': occ_stats_summary,
        'arrivals': arr_stats_summary,
        'departures': dep_stats_summary,
        'tot_occ': occ_stats_summary_cat,
        'tot_arr': arr_stats_summary_cat,
        'tot_dep': dep_stats_summary_cat}

Most of the parameters are similar to those in the original VBA version, though a few new ones have been added. For example, the cat_to_exclude parameter allows you to specify a list of category values for which you do not want occupancy statistics computed. Also, since the VBA version used an Access database as the container for its output, new parameters were added to control output to csv files instead.

Example 1: 60 minute bins, all categories, export to csv

Specify values for all the required inputs:


In [16]:
# Required inputs
scenario = 'ss_example_1'
in_fld_name = 'InRoomTS'
out_fld_name = 'OutRoomTS'
cat_fld_name = 'PatType'
start = '1/1/1996'
end = '3/30/1996 23:45'

# Optional inputs
verbose = 1
output = './output'

Now we'll call the main make_hills function. We won't capture the return values but will simply take the default behavior of having the summaries exported to csv files. You'll see that the filenames will contain the scenario value.


In [17]:
hm.make_hills(scenario, stops_df, in_fld_name, out_fld_name, start, end, cat_fld_name, 
              export_path = output, verbose=verbose)


min of intime: 1996-01-01 07:44:00
max of outtime: 1996-09-30 23:35:00
Datetime DataFrame created (seconds): 18.0369
Summaries by datetime created (seconds): 33.9729
Summaries by category created (seconds): 0.2510
Summaries exported to csv (seconds): 0.2879
Total time (seconds): 52.5504

Here's a screenshot of the output folder containing the csv files created by Hillmaker.

If you've used the previous version of Hillmaker, you'll recognize these files. A few more statistics have been added, but otherwise they are the same. These csv files can be imported into a spreadsheet application for plot creation. Of course, we can also make plots in Python.

The files with 'cat' in their name are new. They contain summary overall summary statistics by category. In other words, they are NOT by time of day and day of week.

Example 2: 30 minute bins, only CAT and IVT, return values to DataFrames


In [8]:
# Required inputs - same as Example 1 except for scenario name
scenario = 'ss_example_2'
in_fld_name = 'InRoomTS'
out_fld_name = 'OutRoomTS'
cat_fld_name = 'PatType'
start = '1/1/1996'
end = '3/30/1996 23:45'

# Optional inputs
tot_fld_name = 'CAT_IVT' # Just to make it clear that it's only these patient types
bin_mins = 30 # Half-hour time bins
exclude = ['ART','MYE','OTH'] # Tell Hillmaker to ignore these patient types
outputpath = '.'

Now we'll call make_hills and tuck the results (a dictionary of DataFrames) into a local variable. Then we can explore them a bit with Pandas.


In [9]:
results_ex2 = hm.make_hills(scenario, stops_df, in_fld_name, out_fld_name, start, end, cat_fld_name,
                            total_str=tot_fld_name, bin_size_minutes=bin_mins, export_path=outputpath,
                            cat_to_exclude=exclude, return_dataframes=True)

In [10]:
results_ex2.keys()


Out[10]:
dict_keys(['bydatetime', 'tot_dep', 'occupancy', 'tot_arr', 'tot_occ', 'arrivals', 'departures'])

In [11]:
occ_df = results_ex2['occupancy']

In [12]:
occ_df.head()


Out[12]:
count cv kurt max mean min p50 p55 p60 p65 ... p80 p85 p90 p95 p975 p99 sem skew stdev var
category day_of_week bin_of_day
CAT 0 0 13 1.916684 0.509139 0.633333 0.125641 0 0 0 0.000000 0.000000 ... 0.30 0.500000 0.500000 0.553333 0.593333 0.617333 0.066790 1.522946 0.240814 0.057991
1 13 3.605551 13.000000 0.333333 0.025641 0 0 0 0.000000 0.000000 ... 0.00 0.000000 0.000000 0.133333 0.233333 0.293333 0.025641 3.605551 0.092450 0.008547
2 13 3.605551 13.000000 0.966667 0.074359 0 0 0 0.000000 0.000000 ... 0.00 0.000000 0.000000 0.386667 0.676667 0.850667 0.074359 3.605551 0.268105 0.071880
3 13 3.076846 12.025087 0.966667 0.087179 0 0 0 0.000000 0.000000 ... 0.00 0.033333 0.133333 0.486667 0.726667 0.870667 0.074396 3.436715 0.268238 0.071952
4 13 1.344087 -1.657170 1.000000 0.328205 0 0 0 0.126667 0.506667 ... 0.82 0.866667 0.966667 1.000000 1.000000 1.000000 0.122349 0.672111 0.441136 0.194601

5 rows × 22 columns


In [13]:
occ_df.tail()


Out[13]:
count cv kurt max mean min p50 p55 p60 p65 ... p80 p85 p90 p95 p975 p99 sem skew stdev var
category day_of_week bin_of_day
IVT 6 43 12 1.929904 2.592905 1.000000 0.172222 0 0 0 0 0 ... 0.4 0.523333 0.56 0.761667 0.880833 0.952333 0.095948 1.835588 0.332372 0.110471
44 12 3.464102 12.000000 1.000000 0.083333 0 0 0 0 0 ... 0.0 0.000000 0.00 0.450000 0.725000 0.890000 0.083333 3.464102 0.288675 0.083333
45 12 2.372342 3.667887 0.833333 0.119444 0 0 0 0 0 ... 0.0 0.210000 0.54 0.705000 0.769167 0.807667 0.081800 2.194808 0.283363 0.080295
46 12 2.335497 2.640000 1.000000 0.166667 0 0 0 0 0 ... 0.0 0.350000 0.90 1.000000 1.000000 1.000000 0.112367 2.055237 0.389249 0.151515
47 12 3.464102 12.000000 0.666667 0.055556 0 0 0 0 0 ... 0.0 0.000000 0.00 0.300000 0.483333 0.593333 0.055556 3.464102 0.192450 0.037037

5 rows × 22 columns


In [14]:
occ_df.info()


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1008 entries, (CAT, 0, 0) to (IVT, 6, 47)
Data columns (total 22 columns):
count    1008 non-null float64
cv       1008 non-null float64
kurt     1008 non-null float64
max      1008 non-null float64
mean     1008 non-null float64
min      1008 non-null float64
p50      1008 non-null float64
p55      1008 non-null float64
p60      1008 non-null float64
p65      1008 non-null float64
p70      1008 non-null float64
p75      1008 non-null float64
p80      1008 non-null float64
p85      1008 non-null float64
p90      1008 non-null float64
p95      1008 non-null float64
p975     1008 non-null float64
p99      1008 non-null float64
sem      1008 non-null float64
skew     1008 non-null float64
stdev    1008 non-null float64
var      1008 non-null float64
dtypes: float64(22)
memory usage: 181.1+ KB

Example 3 - Running via a Python script

Of course, you don't have to run Python statements through an IPython notebook. You can simply create a short Python script and run that directly in a terminal. An example, test_shortstay.py, can be found in the scripts subfolder of the hillmaker-examples project. You can run it from a command prompt like this:

python test_shortstay.py

Here's what it looks like - you can modify as necessary for your needs. There is another example in that folder as well, test_obsim_log.py, that is slightly more complex in that the input data has raw simulation times (i.e. minutes past t=0) and we need to do some datetime math to turn them into calendar based inputs.


In [15]:
import pandas as pd

import hillmaker as hm

file_stopdata = '../data/ShortStay.csv'

# Required inputs
scenario = 'sstest_120'
in_fld_name = 'InRoomTS'
out_fld_name = 'OutRoomTS'
cat_fld_name = 'PatType'
start = '1/1/1996'
end = '3/30/1996 23:45'

# Optional inputs
tot_fld_name = 'SSU'
bin_mins = 120



df = pd.read_csv(file_stopdata, parse_dates=[in_fld_name, out_fld_name])

hm.make_hills(scenario, df, in_fld_name, out_fld_name,
                     start, end, cat_fld_name,
                     tot_fld_name, bin_mins,
                     cat_to_exclude=None,
                     verbose=1)


min of intime: 1996-01-01 07:44:00
max of outtime: 1996-09-30 23:35:00
Datetime DataFrame created (seconds): 16.5991
Summaries by datetime created (seconds): 17.4285
Summaries by category created (seconds): 0.2457
Summaries exported to csv (seconds): 0.1536
Total time (seconds): 34.4285

More elaborate versions of scripts like test_shortstay.py can be envisioned. For example, an entire folder of input data files could be processed by simple enclosing the hm.make_hills call inside a loop over the collection of input files:


In [ ]:
for log_fn in glob.glob('logs/*.csv'):
    
        # Read the log file and filter by included categories
        stops_df = pd.read_csv(log_fn, parse_dates=[in_fld_name, out_fld_name])
        
        hm.make_hills(scenario, df, in_fld_name, out_fld_name, start, end, cat_fld_name)
        ...

User interface plans

Over the years, I (and many others) have used Hillmaker in a variety of ways, including:

  • MS Access form based GUI
  • run main Hillmaker sub from Access VBA Immediate Window
  • run Hillmaker main sub (and/or components subs) via custom VBA procedures

I'd like users to be able to use the new Python based version in a number of different ways as well. As I'll show in this IPython notebook, it can be used by importing the hillmaker module and then calling Hillmaker functions via:

  • an IPython notebook (or any Python terminal such as an IPython shell or QT console, or IDLE)
  • a Python script with the input arguments set and passed via Python statements

While these two options provide tons of flexibility for power users, I also want to create other interfaces that don't require users to write Python code. At a minimum, I plan to create a command line interface (CLI) as well as a GUI that is similar to the old Access version.

A CLI for Hillmaker

Python has several nice tools for creating CLI's. Both docopt and argparse are part of the standard library. Layered on top of these are tools like Click. See http://docs.python-guide.org/en/latest/scenarios/cli/ for more. A well designed CLI will make it easy to use Python from the command line in either Windows or Linux.

A GUI for Hillmaker

This is uncharted territory for me. Python has a number of frameworks/toolkits for creating GUI apps. This is not the highest priority for me but I do plan on creating a GUI for Hillmaker. If anyone wants to help with this, awesome.


In [ ]: