hospital

The hospital table is a very simple table that summarizes high level hospital information. Unlike other tables, it does not contain any patient identifiers, and instead can only be joined to the patient table using hospitalid.

Note: many hospitals described in the hospital table have 0 patient admissions in the patient table.



In [1]:

    
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
import pdvega

# for configuring connection 
from configobj import ConfigObj
import os

%matplotlib inline









    



/home/alistairewj/.local/lib/python3.5/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)



In [2]:

    
# Create a database connection using settings from config file
config='../db/config.ini'

# connection info
conn_info = dict()
if os.path.isfile(config):
    config = ConfigObj(config)
    conn_info["sqluser"] = config['username']
    conn_info["sqlpass"] = config['password']
    conn_info["sqlhost"] = config['host']
    conn_info["sqlport"] = config['port']
    conn_info["dbname"] = config['dbname']
    conn_info["schema_name"] = config['schema_name']
else:
    conn_info["sqluser"] = 'postgres'
    conn_info["sqlpass"] = ''
    conn_info["sqlhost"] = 'localhost'
    conn_info["sqlport"] = 5432
    conn_info["dbname"] = 'eicu'
    conn_info["schema_name"] = 'public,eicu_crd'
    
# Connect to the eICU database
print('Database: {}'.format(conn_info['dbname']))
print('Username: {}'.format(conn_info["sqluser"]))
if conn_info["sqlpass"] == '':
    # try connecting without password, i.e. peer or OS authentication
    try:
        if (conn_info["sqlhost"] == 'localhost') & (conn_info["sqlport"]=='5432'):
            con = psycopg2.connect(dbname=conn_info["dbname"],
                                   user=conn_info["sqluser"])            
        else:
            con = psycopg2.connect(dbname=conn_info["dbname"],
                                   host=conn_info["sqlhost"],
                                   port=conn_info["sqlport"],
                                   user=conn_info["sqluser"])
    except:
        conn_info["sqlpass"] = getpass.getpass('Password: ')

        con = psycopg2.connect(dbname=conn_info["dbname"],
                               host=conn_info["sqlhost"],
                               port=conn_info["sqlport"],
                               user=conn_info["sqluser"],
                               password=conn_info["sqlpass"])
query_schema = 'set search_path to ' + conn_info['schema_name'] + ';'









    



Database: eicu
Username: alistairewj

Compare data completion for hospitals with and without admissions



In [10]:

    
query = query_schema + """
with tt as
(
select hospitalid, count(*) as n
from patient
group by hospitalid
)
select h.*, coalesce(tt.n, 0) as n
from hospital h
left join tt
on h.hospitalid = tt.hospitalid
"""

df = pd.read_sql_query(query, con)
df.head()









    Out[10]:







  
    
      
      hospitalid
      numbedscategory
      teachingstatus
      region
      n
    
  
  
    
      0
      195
      250-500
      True
      South
      1174
    
    
      1
      251
      <100
      False
      Midwest
      318
    
    
      2
      402
      <100
      False
      West
      405
    
    
      3
      120
      <100
      False
      South
      111
    
    
      4
      171
      250-500
      False
      West
      2807



In [11]:

    
print('{} hospitals have 0 admissions.'.format(df.loc[df['n']==0,'hospitalid'].nunique()))









    



251 hospitals have 0 admissions

Data completion among hospitals with admissions

First impute 'missing' so that our groupby reports on the number of hospitals with missing data. Also, we define a convenience function for reporting the absolute count and percent of the total.



In [23]:

    
def count_with_percent(x, N):
    return '{:3d} ({:5.2f}%)'.format(x.count(), x.count()*100.0/N)



In [25]:

    
for c in ['region','numbedscategory','teachingstatus','region']:
    df[c].fillna('Missing',inplace=True)



In [29]:

    
idx = df['n']!=0
N = np.sum(idx)

for c in ['region','numbedscategory','teachingstatus']:
    grp = df.loc[idx, :].groupby(c)['hospitalid']
    
    print('')
    print(grp.apply(count_with_percent, N))









    



region
Midwest       70 (33.65%)
Missing       25 (12.02%)
Northeast     15 ( 7.21%)
South         56 (26.92%)
West          42 (20.19%)
Name: hospitalid, dtype: object

numbedscategory
100-249     66 (31.73%)
250-500     32 (15.38%)
<100        50 (24.04%)
>500        24 (11.54%)
Missing     36 (17.31%)
Name: hospitalid, dtype: object

teachingstatus
False      168 (80.77%)
True        17 ( 8.17%)
Missing     23 (11.06%)
Name: hospitalid, dtype: object

Above we can see that the frequency of missing data ranges between 11-18% in the hospital table.

	hospitalid	numbedscategory	teachingstatus	region	n
0	195	250-500	True	South	1174
1	251	<100	False	Midwest	318
2	402	<100	False	West	405
3	120	<100	False	South	111
4	171	250-500	False	West	2807