This notebook demonstrates a micro-simulation of patients visiting clinical staff. We were motivated to develop this micro-simulation in response to a challenge during privacy and anonymity software course to provide an example of a Graph that would be expected to occur in the health sciences and be analogous to the Graphs that occur social media. Specifically, we where challenged to provide an example of self-referential or recursive, dataset of relationships between pairs of people. To this end, when clinical staff are allowed to be patients as well, and patients are allowed to have multiple visits over a time span, the record of patient visits with clinical staff is a self-referential data set. This dataset turns out to be a Multi-Graph with the Edge sets parameterized by time, under the constraint that at any moment in time a patient vertex is unique in the set of all edges that exist at that moment. This constraint amounts to requiring that every patient is involved in only one visit at a time.
In developing the micro-simulation we also fortuitously generated a dataset that is at higher risk of re-identification then would be naively calculated based solely on the uniform probability assumptions used in the standard risk calculations. This arose from an effort to make the micro-simulation more realistic by organizing patients and clinical staff into clinics, representing the exclusive employment of the clinical staff at the clinic and the exclusive registration of the patient at the clinic. We further stipulated that patients can only visit the clinical staff of the clinic to which the patient is registered, and that every clinical staff must be registered as a patient at a clinic that is different from the one at which they are employed. In effect for risk calculations, this reduces the population size to that of the average size of a single clinic, because each clinic is a distinct and independent population of patients.
Consider a fixed population, where any member can be a patient at any time. The population forms the Vertexes of the Graph.
$$ \begin{align} V & = \left\lbrace \text{person} : \text{person} \in \text{population} \right\rbrace \end{align} $$A subset of this population are also clinical staff.
$$ \begin{align} V_\text{staff} & = \left\lbrace \text{person} : \text{person} \in V \text{, person is clinical staff} \right\rbrace\\ \end{align} $$The patients visiting clinical staff on each day forms a set of edges parameterized by time.
$$ \begin{align} E_\text{day} & = \left\lbrace \left( \text{day}, \text{patient}, \text{staff} \right) : \text{patient} \in V \text{, staff} \in V_\text{staff} \text{, patient visits staff on day} \right\rbrace \end{align} $$The number of possible realizations of clinical visit event datasets, regardless of the assignments of clinics, or entailing sampling probabilities, is fixed.
$$ \begin{align} N & = \left( \sum_{d = 0}^{\#\text{days}} \binom{\#\text{days}}{d} \left( \#V_\text{staff} \right)^d \right)^{\#V - \#V_\text{staff}} \times \left( \sum_{d = 0}^{\#\text{days}} \binom{\#\text{days}}{d} \left( \#V_\text{staff} - 1\right)^d \right)^{\#V_\text{staff}}\\ & = \left( \#V_\text{staff} + 1 \right)^{\#\text{days} \cdot \left( \#V - \#V_\text{staff} \right)} \times \left( \#V_\text{staff} \right)^{\#\text{days} \cdot \#V_\text{staff}} \end{align} $$The left hand term in the product is the number of combinations of visits among patients who are not clinical staff. Conversely, the right hand term in the product is the number of combinations of visits among patients who are clinical staff
Even the most rudimentary assumption that there is an average number of expected visits in a time frame,
$$ \begin{align} \mathbb{P}\left[\text{visit} \middle\| \text{patient} \right] & = \frac{\text{expected visits}}{\#\text{days}} \end{align} $$will cause the dataset of all clinical visit events to be non-uniformly sampled from the all possible realizations of datasets.
$$ \begin{align} \mathbb{P}\left[E\right] & = \left( \prod_{v \notin V_\text{staff}} \left( \frac{1}{\#V_\text{staff}} \right)^{d_v} \left( \frac{\text{expected visits}}{\#\text{days}} \right)^{d_v} \left( 1 - \frac{\text{expected visits}}{\#\text{days}} \right)^{\#\text{days} - d_v} \right)\\ & \quad \times \left( \prod_{v \in V_\text{staff}} \left( \frac{1}{\#V_\text{staff} - 1} \right)^{d_v} \left( \frac{\text{expected visits}}{\#\text{days}} \right)^{d_v} \left( 1 - \frac{\text{expected visits}}{\#\text{days}} \right)^{\#\text{days} - d_v} \right) \end{align} $$In the product $d_v$ is the number of events of the person represented by edge $v$.
This construction naively assumes that any patient can visit any clinician on any given day, with uniform probability, provided no patient has two visits, Edges, on the same day.
$$ \begin{align} \mathbb{P} \left[ \text{staff} \middle\| \text{day, patient} \notin V_\text{staff} \right] & = \frac{1}{\#V_\text{staff}}\\ \mathbb{P} \left[ \text{staff} \middle\| \text{day, patient} \in V_\text{staff} \right] & = \frac{1}{\#V_\text{staff} - 1} \end{align} $$Without altering the assumption of an average number of expected visits in a time frame, we can alter the probability of a patient visiting a particular clinical staff, from the uniform probability. We accomplish this by organizing patients and clinical staff into clinics, and staff into shifts.
The clinics and shift schedules are assigned randomly following a set of rules.
These rules constrain both the number of possible clinical staff a patient can visit, and the number of clinical staff seeing patients on any day, thus the decrease the denominator in the probability.
Aside from the total population, the number of clinic staff, the number of clinics, and the date range, the model has two parameters that can be set from the empirical statistics:
For each clinic the model calculates the number of staff to schedule on every day, using the number of patients registered at the clinic, the number of days simulated, the expected number of visits, and the expected workload.
$$ \begin{align} \text{daily scheduled staff} & = \frac{\text{number of patients} \times \text{expected visits per patient}}{\text{days simulated} \times \text{expected daily client load per staff}} \end{align} $$This number is truncated to stay between at least one and the total number of clinical staff employed at the clinic.
The engine of the simulation has been wrapped in a module called MicroSimulation
, which can be found on GitHub here.
In [1]:
include("MicroSimulation.jl")
Before running the stochastic event simulation a population must be created, randomly assigning clinics to patients and staff. We do this using the Registration
constructor.
In [2]:
?MicroSimulation.Registrations
Out[2]:
We will bootstrap the simulation from a rough aggregation of the fiscal 2016/17 visits statistics.
Statistic | Value |
---|---|
Population | $4067175$ |
Patients | $3578579$ |
Patients Prescribed | $2569015$ |
Providers | $12486$ |
Providers Prescribing | $10675$ |
Clinics | $2687$ |
Patient Days | $24438296$ |
Patient Prescribed Days | $8030584$ |
Provider Days | $1423060$ |
Visits | $26360227$ |
Visits Prescribed | $8140983$ |
The theoretical single day probability of revisit conditioned on the ordinal of the previous visit is approximately quadratic.
$$ \begin{align} \mathbb{P} \left[ v \middle\| v - 1 \right] & = 23\% \times \frac{ v \left( 365 - v \right) }{182.5^2} \end{align} $$The simulation will consist of $4100000$ patients, of which $13000$ are also clinical staff, distributed across $2700$ clinics.
In [3]:
rs = MicroSimulation.Registrations(4100000, 13000, 2700)
Out[3]:
Given the randomly generated assignments of clinics to patients and staff we can now stochastically generate visit events by sequentially sampling the conditional probabilities. This is done by calling the Events constructor.
In [4]:
?MicroSimulation.Events
Out[4]:
We simulate a single year's events, where each person has $6$ visits on average, and each clinical staff can attend to $19$ patients in a single day on average.
In [6]:
evs = MicroSimulation.Events(Date(2016, 4, 1):Date(2017, 3, 31), rs, 6, 19)
Out[6]:
Write the Registrations
and the Events
to *.csv
files.
In [14]:
f = open("registrations.csv", "w")
writecsv(f, ["personidentifier" "patientclinic" "staffclinic"])
writecsv(f, rs)
close(f)
In [16]:
f = open("events.csv", "w")
writecsv(f, ["personidentifier" "patientidentifier" "staffidentifier"])
writecsv(f, evs)
close(f)
The observed empirical estimation of the conditional probability of revisit is not an artifact of the sampling process.
In [21]:
function uniformvisitsimulate(pop::Int64,days::Int64,visits::Int64)
prob = visits/days
ps = Vector{Int64}()
ds = Vector{Int64}()
cs = Vector{Int64}()
bs = Vector{Int64}()
as = Vector{Int64}()
for p = 1:pop
c = 0
push!(ps, p)
push!(ds, 0)
push!(cs, c)
push!(bs, 0)
push!(as, 365)
for d = 1:days
if prob > rand()
c += 1
b = d - ds[end]
as[end] = b
push!(ps, p)
push!(ds, d)
push!(cs, c)
push!(bs, b)
push!(as, 365 - d)
end
end
end
return [ps cs ds bs as]
end
Out[21]:
In [22]:
a = uniformvisitsimulate(4000000,365,6)
Out[22]:
In [23]:
f = open("uniformvisits.csv", "w")
writecsv(f, ["personidentifier" "visitorder" "visitday" "daysbefore" "daysafter"])
writecsv(f, a)
close(f)
In [1]:
using PlotlyJS
In [66]:
ps = (1 .- exp.(-0.03125 .* collect(1:365)));
ts = diagm(1 .- [ps; 0]) .+ diagm(ps, 1);
sum((ts^365)[1, :])
Out[66]:
In [67]:
plot((ts^365)[1, :])
Out[67]:
In [70]:
det([0.0 0.5 0.5;0.5 0.0 0.5; 0.5 0.5 0.0])
Out[70]: