Data Visualization with Python

Authors: Prof. med. Thomas Ganslandt Thomas.Ganslandt@medma.uni-heidelberg.de
and Kim Hee HeeEun.Kim@medma.uni-heidelberg.de

Heinrich-Lanz-Center for Digital Health (HLZ) of the Medical Faculty Mannheim
Heidelberg University

This is a part of a tutorial prepared for TMF summer school on 03.07.2019

Prerequisite: Access the MIMIC-III Dataset

The MIMIC (Medical Information Mart for Intensive Care) is a freely accessible database containing Intensive Care Unit (ICU) patients. The demo dataset is limited to 100 patients and publicly available as CSV files or as a single Postgres database backup file

Instruction to access the MIMIC demo dataset:

  1. Create an account on PhysioNet using the following link: https://physionet.org/register/
  2. Navigate to the project page: https://physionet.org/content/mimiciii-demo/
  3. Read the Data Use Agreement and click “I agree” to access the data </font>

Prerequisite: MIMIC-III files locally

You should place the following MIMIC-III data files in the data/ subfolder:

  • ADMISSIONS.csv
  • PATIENTS.csv
  • CPTEVENTS.csv

Agenda

  • Pandas
  • </b>Pandas-Profiling</b>
  • </b>Missingno</b>
  • </b>Wordcloud</b>

Pandas

http://pandas.pydata.org/pandas-docs/stable/reference/
Pandas is a Python library for exploring, processing, and model data

Pandas supports charting a tabular dataset

DataFrame.plot([x, y], kind)

kind :

  • 'line': line plot (default)
  • 'bar': vertical bar plot
  • 'barh': horizontal bar plot
  • 'hist': histogram
  • 'box': boxplot
  • 'kde': Kernel Density Estimation plot
  • 'density': same as 'kde'
  • 'area': stacked area plot
  • 'pie': pie plot
  • 'scatter': scatter plot
  • 'hexbin': Hexagonal binning plot

Visualize the admission table


In [ ]:
import pandas as pd
pd.set_option('display.max_columns', 999)
import pandas.io.sql as psql
# plot a figure directly on Notebook
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
a = pd.read_csv("data/ADMISSIONS.csv")
a.columns = map(str.lower, a.columns)
a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')

In [ ]:
a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')

In [ ]:
p = pd.read_csv("data/PATIENTS.csv")
p.columns = map(str.lower, p.columns)
ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')
ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)

In [ ]:
c = pd.read_csv("data/CPTEVENTS.csv")
c.columns = map(str.lower, c.columns)
ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')
ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)

Agenda

  • </b>Pandas</b>
  • Pandas-Profiling
  • </b>Missingno</b>
  • </b>Wordcloud</b>

Pandas-Profiling

https://github.com/pandas-profiling/pandas-profiling
Pandas-Profiling is a Python library for exploratory data analysis

Import pandas-profiling (1/3)


In [ ]:
# !conda install -c conda-forge pandas-profiling -y
import pandas_profiling

Load the admissions table (2/3)


In [ ]:
a = pd.read_csv("data/ADMISSIONS.csv")
a.columns = map(str.lower, a.columns)

Profile the table (3/3)


In [ ]:
# ignore the times when profiling since they are uninteresting
cols = [c for c in a.columns if not c.endswith('time')]
pandas_profiling.ProfileReport(a[cols])

Agenda

  • </b>Pandas</b>
  • </b>Pandas-Profiling</b>
  • Missingno
  • </b>Wordcloud</b>

Missingno

https://github.com/ResidentMario/missingno
Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts about ADMISSIONS table:

  • Not every patient is admitted to the emergency department as there are many missing values in edregtime and edouttime.
  • language data of patients is mendatory field, but it used to be not.

In [ ]:
# !conda install -c conda-forge missingno -y
import missingno as msno
msno.matrix(a)

Agenda

  • </b>Pandas</b>
  • </b>Pandas-Profiling</b>
  • </b>Missingno</b>
  • Wordcloud

Wordcloud

https://github.com/amueller/word_cloud
Wordcloud visualizes a given text in a word-cloud format
This example illustrates that majority of patients suffered from sepsis

Import the Wordcloud package (1/4)


In [ ]:
# !conda install -c conda-forge wordcloud -y
from wordcloud import WordCloud

Prepare an input text in string (2/4)


In [ ]:
text = str(a['diagnosis'].values)

Generate a word-cloud from the input text (3/4)


In [ ]:
wordcloud = WordCloud().generate(text)

Plot the word-cloud (4/4)


In [ ]:
import matplotlib.pyplot as plt
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()

Question?

Authors: Prof. med. Thomas Ganslandt Thomas.Ganslandt@medma.uni-heidelberg.de
and Kim Hee HeeEun.Kim@medma.uni-heidelberg.de

Heinrich-Lanz-Center for Digital Health (HLZ) of the Medical Faculty Mannheim
Heidelberg University

This is a part of a tutorial prepared for TMF summer school on 03.07.2019