Authors: Prof. med. Thomas Ganslandt Thomas.Ganslandt@medma.uni-heidelberg.de
and Kim Hee HeeEun.Kim@medma.uni-heidelberg.de
Heinrich-Lanz-Center for Digital Health (HLZ) of the Medical Faculty Mannheim
Heidelberg University
This is a part of a tutorial prepared for TMF summer school on 03.07.2019
The MIMIC (Medical Information Mart for Intensive Care) is a freely accessible database containing Intensive Care Unit (ICU) patients. The demo dataset is limited to 100 patients and publicly available as CSV files or as a single Postgres database backup file
Instruction to access the MIMIC demo dataset:
- Create an account on PhysioNet using the following link: https://physionet.org/register/
- Navigate to the project page: https://physionet.org/content/mimiciii-demo/
- Read the Data Use Agreement and click “I agree” to access the data </font>
You should place the following MIMIC-III data files in the data/ subfolder:
Agenda
http://pandas.pydata.org/pandas-docs/stable/reference/
Pandas is a Python library for exploring, processing, and model data
DataFrame.plot([x, y], kind)
kind :
- 'line': line plot (default)
- 'bar': vertical bar plot
- 'barh': horizontal bar plot
- 'hist': histogram
- 'box': boxplot
- 'kde': Kernel Density Estimation plot
- 'density': same as 'kde'
- 'area': stacked area plot
- 'pie': pie plot
- 'scatter': scatter plot
- 'hexbin': Hexagonal binning plot
In [ ]:
import pandas as pd
pd.set_option('display.max_columns', 999)
import pandas.io.sql as psql
# plot a figure directly on Notebook
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
a = pd.read_csv("data/ADMISSIONS.csv")
a.columns = map(str.lower, a.columns)
a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')
In [ ]:
a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')
In [ ]:
p = pd.read_csv("data/PATIENTS.csv")
p.columns = map(str.lower, p.columns)
ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')
ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)
In [ ]:
c = pd.read_csv("data/CPTEVENTS.csv")
c.columns = map(str.lower, c.columns)
ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')
ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)
Agenda
https://github.com/pandas-profiling/pandas-profiling
Pandas-Profiling is a Python library for exploratory data analysis
In [ ]:
# !conda install -c conda-forge pandas-profiling -y
import pandas_profiling
In [ ]:
a = pd.read_csv("data/ADMISSIONS.csv")
a.columns = map(str.lower, a.columns)
In [ ]:
# ignore the times when profiling since they are uninteresting
cols = [c for c in a.columns if not c.endswith('time')]
pandas_profiling.ProfileReport(a[cols])
Agenda
https://github.com/ResidentMario/missingno
Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts about ADMISSIONS table:
edregtime and edouttime.language data of patients is mendatory field, but it used to be not.
In [ ]:
# !conda install -c conda-forge missingno -y
import missingno as msno
msno.matrix(a)
Agenda
https://github.com/amueller/word_cloud
Wordcloud visualizes a given text in a word-cloud format
This example illustrates that majority of patients suffered from sepsis
In [ ]:
# !conda install -c conda-forge wordcloud -y
from wordcloud import WordCloud
In [ ]:
text = str(a['diagnosis'].values)
In [ ]:
wordcloud = WordCloud().generate(text)
In [ ]:
import matplotlib.pyplot as plt
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()
Authors: Prof. med. Thomas Ganslandt Thomas.Ganslandt@medma.uni-heidelberg.de
and Kim Hee HeeEun.Kim@medma.uni-heidelberg.de
Heinrich-Lanz-Center for Digital Health (HLZ) of the Medical Faculty Mannheim
Heidelberg University
This is a part of a tutorial prepared for TMF summer school on 03.07.2019