First, we extract the data to a csv (alternatively, we could use a database connection here and extract the data directly from the database).
Copy (
select ce.icustay_id, charttime, itemid, valuenum, adm.hospital_expire_flag
from mimiciii.chartevents ce
inner join mimiciii.icustays ie
on ce.icustay_id = ie.icustay_id
inner join mimiciii.admissions adm
on ce.hadm_id = adm.hadm_id
inner join mimiciii.patients pat
on ce.subject_id = pat.subject_id
where ce.charttime between ie.intime and (ie.intime + interval '1 day')
and extract(EPOCH from (ie.intime - pat.dob)) >= (60*60*24*12*15) -- older than 15, i.e. an adult
and itemid in
(
618, -- Respiratory Rate
220210, -- Respiratory Rate
211, -- Heart Rate
220045 -- Heart Rate
)
order by icustay_id, charttime
) To '/data/mimic3/mimic-hr-rr.csv' With CSV HEADER;
We now assume that the data is in a csv file called mimic-hr-rr.csv
in the local directory.
In [1]:
% load the data
fp = fopen('mimic-hr-rr.csv');
header = fgetl(fp);
% convert header from a string to a cell array of strings
header = regexp(header,',','split');
frmt = '%f%s%f%f%f';
data = textscan(fp,frmt,'delimiter',',');
fclose(fp);
% convert the date string into a MATLAB's numeric format
data{2} = datenum(data{2},'yyyy-mm-dd HH:MM:SS');
% now we can convert data from a cell array to a matrix
data = [data{:}];
% here's a preview of the data ('\t' is a tab)
fprintf('%12s\t',header{:});
fprintf('\n')
frmt = '%12g\t%12.2f\t%12g\t%12g\t%1d';
for n=1:5
fprintf(frmt,data(n,:));
fprintf('\n');
end
Out[1]:
In the above, you can see:
ICUSTAY_ID
- This is the unique integer which identifies an ICU stay.CHARTTIME
- This is the time at which a measurement is recorded. It represents the number of days since January 0, 0000.ITEMID
- This is a unique integer which represents the type of data recorded. 220210 is respiratory rate, and 220045 is heart rate.VALUENUM
- This is the actual value of the measurement. So we can see that ICUSTAY_ID
200001 had a respiratory rate of 22 breaths per minute (we have not included the unit of measurement here, but it is in the database if you are interested in confirming this).HOSPITAL_EXPIRE_FLAG
- This indicates whether the patient died in the hospital (1 is death at hospital discharge).We can plot the first patient's data as follows:
In [ ]:
id = 200001; % which icustay_id we'd like to plot
idxID = data(:,1) == id; % only plot data for 1 patient
idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;
figure(1); hold all;
plot(data(idxID & idxHR,2),data(idxID & idxHR,4),'-',...
'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(data(idxID & idxRR,2),data(idxID & idxRR,4),'-',...
'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);
Above we can see the heart rate in red and the respiratory rate in blue. The bottom axis is the days since January 0, 0000 - a bit hard to interpret but we can see that the data spans 1 day.
In [ ]:
% We can use sorting to get the maximum and minimum value
% This is quite complicated syntax - we need to perform vectorized operations
% Note: this type of task is *much* easier in SQL!
[id_unique, idxID] = unique(data(:,1)); % get a list of all unique ICUSTAY_IDs
X = nan(size(id_unique,1),4);
idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;
tic; % we time how long this process takes
data_tmp = data(idxHR,:);
data_tmp = sortrows(data_tmp, [1,4]); % minimum HR is the first row for each ICUSTAY_ID
[id_tmp,idxA] = unique(data_tmp(:,1));
[idxExist,idxMap] = ismember(id_unique, id_tmp);
X(idxMap(idxExist),1) = data_tmp(idxA,4);
% Repeat for the *maximum* heart rate
data_tmp = sortrows(data_tmp, [1,-4]); % maximum HR is now the first row for each ICUSTAY_ID
X(idxMap(idxExist),2) = data_tmp(idxA,4);
% Repeat for respiratory rate
data_tmp = data(idxRR,:);
data_tmp = sortrows(data_tmp, [1,4]); % minimum RR is the first row for each ICUSTAY_ID
[id_tmp,idxA] = unique(data_tmp(:,1));
[idxExist,idxMap] = ismember(id_unique, id_tmp);
X(idxMap(idxExist),3) = data_tmp(idxA,4);
% Repeat for the *maximum* heart rate
data_tmp = sortrows(data_tmp, [1,-4]); % maximum RR is now the first row for each ICUSTAY_ID
X(idxMap(idxExist),4) = data_tmp(idxA,4);
toc;
% Clear variables so we don't accidentally use the wrong data in temp variables later on
clear data_tmp idxRR idxHR id_tmp idxA;
% Preview of the data:
X(1:5,:)
In [ ]:
% This is equivalent to the above cell, but using for loops
% It takes ~5-10 minutes to run
[id_unique,idxID] = unique(data(:,1)); % get a list of all unique ICUSTAY_IDs
X_slow = nan(size(id_unique,1),4);
idxHR = data(:,3) == 211 | data(:,3) == 220045;
idxRR = data(:,3) == 618 | data(:,3) == 220210;
tic; % we time how long this takes
for n=1:size(id_unique,1)
idxCurrentID = data(:,1) == id_unique(n);
idx = idxCurrentID & idxHR;
if any(idx)
X_slow(n,1) = min(data(idx,4));
X_slow(n,2) = max(data(idx,4));
end
idx = idxCurrentID & idxRR;
if any(idx)
X_slow(n,3) = min(data(idx,4));
X_slow(n,4) = max(data(idx,4));
end
end
toc;
% Clear variables so we don't accidentally use the wrong data in temp variables later on
clear idxRR idxHR idxCurrentID idx;
% let's show a preview of X:
X(1:5,:)
In [ ]:
y = data(idxID,5); % get the outcome for each patient
% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(y==1,1), X(y==1,2),'x',...
'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(y==0,1), X(y==0,2),'o',...
'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);
xlabel('Lowest heart rate');
ylabel('Highest heart rate');
Above we can see most people have heart rates between 0-200, except a few with heart rates around 1000 and one with a highest heart rate of 5500. Clearly these are not physiological - you'll find these "outliers" frequently in medical data - it's a consequence of the secondary nature of our analysis. It's obvious to any care provider that these are not possible, so they are ignored during routine care, and not sanitized in the database. We have to fix them ourselves! For now, we can ignore these and set the limits on our plot. Later, we will preprocess these data appropriately.
In [ ]:
y = data(idxID,5); % get the outcome for each patient
% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(y==1,1), X(y==1,2),'x',...
'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(y==0,1), X(y==0,2),'o',...
'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);
xlabel('Lowest heart rate');
ylabel('Highest heart rate');
% change the axis to reasonable limits
set(gca,'XLim',[0,240],'YLim',[0,240]);
Now we have a wonderful blob of data. This is because there are so many data points! What a wonderful problem to have. We limit the plot to 200 data points (100 in each class, survived or died in hospital) - this will give us a better visualization. We pick these data points randomly.
In [ ]:
y = data(idxID,5); % get the outcome for each patient
N_DATA_POINTS = 100; % Number of data points to plot for each class - must be less than 6342, the number of deaths
rng(777,'twister'); % fix the random number seed so everyone's plots look identical
idx0 = find(y==0);
[~,idxRand] = sort(rand(size(idx0,1),1),1);
idx0 = idx0(idxRand(1:N_DATA_POINTS));
idx1 = find(y==1);
[~,idxRand] = sort(rand(size(idx1,1),1),1);
idx1 = idx1(idxRand(1:N_DATA_POINTS));
% plot the variables against each other, coloured by their outcome
figure(1); clf; hold all;
plot(X(idx1,1), X(idx1,2),'x',...
'Linewidth',2,'Color',[0.8906, 0.1016, 0.1094]);
plot(X(idx0,1), X(idx0,2),'o',...
'Linewidth',2,'Color',[0.2148, 0.4922, 0.7188]);
xlabel('Lowest heart rate');
ylabel('Highest heart rate');
% change the axis to reasonable limits
set(gca,'XLim',[0,240],'YLim',[0,240]);