Extract data

First, we extract the data to a csv (alternatively, we could use a database connection here and extract the data directly from the database).

Copy (
  select ce.icustay_id, charttime, itemid, valuenum, adm.hospital_expire_flag
  from mimiciii.chartevents ce
  inner join mimiciii.icustays ie
    on ce.icustay_id = ie.icustay_id

  inner join mimiciii.admissions adm
    on ce.hadm_id = adm.hadm_id
  inner join mimiciii.patients pat
    on ce.subject_id = pat.subject_id

  where ce.charttime between ie.intime and (ie.intime + interval '1 day')
  and extract(EPOCH from (ie.intime - pat.dob)) >= (60*60*24*12*15) -- older than 15, i.e. an adult
  and itemid in
  (
  618, --   Respiratory Rate
  220210, --    Respiratory Rate

  211, --   Heart Rate
  220045 -- Heart Rate
  )
  order by icustay_id, charttime
) To '/data/mimic3/mimic-hr-rr.csv' With CSV HEADER;

We now assume that the data is in a csv file called mimic-hr-rr.csv in the local directory.


In [1]:
% load the data

fp = fopen('mimic-hr-rr.csv');
header = fgetl(fp);

% convert header from a string to a cell array of strings
header = regexp(header,',','split');

frmt = '%f%s%f%f%f';
data = textscan(fp,frmt,'delimiter',',');
fclose(fp);

% convert the date string into a MATLAB's numeric format
data{2} = datenum(data{2},'yyyy-mm-dd HH:MM:SS');

% now we can convert data from a cell array to a matrix
data = [data{:}];

% here's a preview of the data ('\t' is a tab)
fprintf('%12s\t',header{:});
fprintf('\n')

frmt = '%12g\t%12.2f\t%12g\t%12g\t%1d';
for n=1:5
    fprintf(frmt,data(n,:));
    fprintf('\n');
end


Out[1]:
icustay_id	   charttime	      itemid	    valuenum	hospital_expire_flag	
      200001	   796924.80	      220210	          22	0
      200001	   796924.80	      220045	         114	0
      200001	   796924.80	      220210	          26	0
      200001	   796924.83	      220045	         113	0
      200001	   796924.83	      220210	          20	0

In the above, you can see:

  • ICUSTAY_ID - This is the unique integer which identifies an ICU stay.
  • CHARTTIME - This is the time at which a measurement is recorded. It represents the number of days since January 0, 0000.
  • ITEMID - This is a unique integer which represents the type of data recorded. 220210 is respiratory rate, and 220045 is heart rate.
  • VALUENUM - This is the actual value of the measurement. So we can see that ICUSTAY_ID 200001 had a respiratory rate of 22 breaths per minute (we have not included the unit of measurement here, but it is in the database if you are interested in confirming this).
  • HOSPITAL_EXPIRE_FLAG - This indicates whether the patient died in the hospital (1 is death at hospital discharge).

We can plot the first patient's data as follows:


In [ ]:
id = 200001; % which icustay_id we'd like to plot

idxID = data(:,1) == id; % only plot data for 1 patient
idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;

figure(1); hold all;
plot(data(idxID & idxHR,2),data(idxID & idxHR,4),'-',...
    'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(data(idxID & idxRR,2),data(idxID & idxRR,4),'-',...
    'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);

Above we can see the heart rate in red and the respiratory rate in blue. The bottom axis is the days since January 0, 0000 - a bit hard to interpret but we can see that the data spans 1 day.

Extracting data

Now we have plotted the data for a few patients and have an idea of what it looks like. We'd like to extract some data which is useable in our machine learning classifiers. That means we need to convert this time-series into a design matrix.


In [ ]:
% We can use sorting to get the maximum and minimum value
% This is quite complicated syntax - we need to perform vectorized operations
% Note: this type of task is *much* easier in SQL!

[id_unique, idxID] = unique(data(:,1)); % get a list of all unique ICUSTAY_IDs
X = nan(size(id_unique,1),4);

idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;

tic; % we time how long this process takes

data_tmp = data(idxHR,:);
data_tmp = sortrows(data_tmp, [1,4]); % minimum HR is the first row for each ICUSTAY_ID

[id_tmp,idxA] = unique(data_tmp(:,1));
[idxExist,idxMap] = ismember(id_unique, id_tmp);
X(idxMap(idxExist),1) = data_tmp(idxA,4);

% Repeat for the *maximum* heart rate
data_tmp = sortrows(data_tmp, [1,-4]); % maximum HR is now the first row for each ICUSTAY_ID
X(idxMap(idxExist),2) = data_tmp(idxA,4);


% Repeat for respiratory rate
data_tmp = data(idxRR,:);
data_tmp = sortrows(data_tmp, [1,4]); % minimum RR is the first row for each ICUSTAY_ID

[id_tmp,idxA] = unique(data_tmp(:,1));
[idxExist,idxMap] = ismember(id_unique, id_tmp);
X(idxMap(idxExist),3) = data_tmp(idxA,4);

% Repeat for the *maximum* heart rate
data_tmp = sortrows(data_tmp, [1,-4]); % maximum RR is now the first row for each ICUSTAY_ID
X(idxMap(idxExist),4) = data_tmp(idxA,4);

toc;

% Clear variables so we don't accidentally use the wrong data in temp variables later on
clear data_tmp idxRR idxHR id_tmp idxA;

% Preview of the data:
X(1:5,:)

In [ ]:
% This is equivalent to the above cell, but using for loops
% It takes ~5-10 minutes to run

[id_unique,idxID] = unique(data(:,1)); % get a list of all unique ICUSTAY_IDs
X_slow = nan(size(id_unique,1),4);

idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;

tic; % we time how long this takes

for n=1:size(id_unique,1)
    idxCurrentID = data(:,1) == id_unique(n);
    
    idx = idxCurrentID & idxHR;
    if any(idx)
        X_slow(n,1) = min(data(idx,4));
        X_slow(n,2) = max(data(idx,4));
    end
    
    idx = idxCurrentID & idxRR;
    if any(idx)
        X_slow(n,3) = min(data(idx,4));
        X_slow(n,4) = max(data(idx,4));
    end
end

toc;

% Clear variables so we don't accidentally use the wrong data in temp variables later on
clear idxRR idxHR idxCurrentID idx;

% let's show a preview of X:
X(1:5,:)

In [ ]:
y = data(idxID,5); % get the outcome for each patient

% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(y==1,1), X(y==1,2),'x',...
    'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(y==0,1), X(y==0,2),'o',...
    'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);

xlabel('Lowest heart rate');
ylabel('Highest heart rate');

Above we can see most people have heart rates between 0-200, except a few with heart rates around 1000 and one with a highest heart rate of 5500. Clearly these are not physiological - you'll find these "outliers" frequently in medical data - it's a consequence of the secondary nature of our analysis. It's obvious to any care provider that these are not possible, so they are ignored during routine care, and not sanitized in the database. We have to fix them ourselves! For now, we can ignore these and set the limits on our plot. Later, we will preprocess these data appropriately.


In [ ]:
y = data(idxID,5); % get the outcome for each patient

% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(y==1,1), X(y==1,2),'x',...
    'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(y==0,1), X(y==0,2),'o',...
    'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);

xlabel('Lowest heart rate');
ylabel('Highest heart rate');

% change the axis to reasonable limits
set(gca,'XLim',[0,240],'YLim',[0,240]);

Now we have a wonderful blob of data. This is because there are so many data points! What a wonderful problem to have. We limit the plot to 200 data points (100 in each class, survived or died in hospital) - this will give us a better visualization. We pick these data points randomly.


In [ ]:
y = data(idxID,5); % get the outcome for each patient

N_DATA_POINTS = 100; % Number of data points to plot for each class - must be less than 6342, the number of deaths

rng(777,'twister'); % fix the random number seed so everyone's plots look identical

idx0 = find(y==0);
[~,idxRand] = sort(rand(size(idx0,1),1),1);
idx0 = idx0(idxRand(1:N_DATA_POINTS));

idx1 = find(y==1);
[~,idxRand] = sort(rand(size(idx1,1),1),1);
idx1 = idx1(idxRand(1:N_DATA_POINTS));

% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(idx1,1), X(idx1,2),'x',...
    'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(idx0,1), X(idx0,2),'o',...
    'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);

xlabel('Lowest heart rate');
ylabel('Highest heart rate');

% change the axis to reasonable limits
set(gca,'XLim',[0,240],'YLim',[0,240]);