Download, Parse and Interrogate Apple Health Export Data

The first part of this program is all about getting the Apple Health export and putting it into an analyzable format. At that point it can be analysed anywhere. The second part of this program is concerned with using SAS Scripting Wrapper for Analytics Transfer (SWAT) Python library to transfer the data to SAS Viya, and analyze it there. The SWAT package provides native python language access to the SAS Viya codebase.

https://github.com/sassoftware/python-swat

This file was created from a desire to get my hands on data collected by Apple Health, notably heart rate information collected by Apple Watch. For this to work, this file needs to be in a location accessible to Python code. A little bit of searching told me that iCloud file access is problematic and that there were already a number of ways of doing this with the Google API if the file was saved to Google Drive. I chose PyDrive. So for the end to end program to work with little user intervention, you will need to sign up for Google Drive, set up an application in the Google API and install Google Drive app to your iPhone.

This may sound involved, and it is not necessary if you simply email the export file to yourself and copy it to a filesystem that Python can see. If you choose to do that, all of the Google Drive portion can be removed. I like the Google Drive process though as it enables a minimal manual work scenario.

This version requires the user to grant Google access, requiring some additional clicks, but it is not too much. I think it is possible to automate this to run without user intervention as well using security files.

The first step to enabling this process is exporting the data from Apple Health. As of this writing, open Apple Health and click on your user icon or photo. Near the bottom of the next page in the app will be a button or link called Export Health Data. Clicking on this will generate a xml file, zipped up. THe next dialog will ask you where you want to save it. Options are to email, save to iCloud, message etc... Select Google Drive. Google Drive allows multiple files with the same name and this is accounted for by this program.


In [59]:
import xml.etree.ElementTree as et
import pandas as pd
import numpy as np
from datetime import *
import matplotlib.pyplot as plt
import re 
import os.path
import zipfile


%matplotlib inline
plt.rcParams['figure.figsize'] = 16, 8

Authenticate with Google

This will open a browser to let you beging the process of authentication with an existing Google Drive account. This process will be separate from Python. For this to work, you will need to set up a Other Authentication OAuth credential at https://console.developers.google.com/apis/credentials, save the secret file in your root directory and a few other things that are detailed at https://pythonhosted.org/PyDrive/. The PyDrive instructions also show you how to set up your Google application. There are other methods for accessing the Google API from python, but this one seems pretty nice. The first time through the process, regular sign in and two factor authentication is required (if you require two factor auth) but after that it is just a process of telling Google that it is ok for your Google application to access Drive.


In [60]:
# Authenticate into Google Drive
from pydrive.auth import GoogleAuth

gauth = GoogleAuth()
gauth.LocalWebserverAuth()

Download the most recent Apple Health export file

Now that we are authenticated into Google Drive, use PyDrive to access the API and get to files stored.

Google Drive allows multiple files with the same name, but it indexes them with the ID to keep them separate. In this block, we make one pass of the file list where the file name is called export.zip, and save the row that corresponds with the most recent date. We will use that file id later to download the correct file that corresponds with the most recent date. Apple Health export names the file export.zip, and at the time this was written, there is no other option.


In [61]:
from pydrive.drive import GoogleDrive
drive = GoogleDrive(gauth)

file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()

# Step through the file list and find the most current export.zip file id, then use 
#      that later to download the file to the local machine.
# This may look a little old school, but these file lists will never be massive and 
#     it is readable and easy one pass way to get the most current file using the 
#     least (or low) amount of resouces
selection_dt = datetime.strptime("2000-01-01T01:01:01.001Z","%Y-%m-%dT%H:%M:%S.%fZ")
print("Matching Files")
for file1 in file_list: 
    if re.search("^export-*\d*.zip",file1['title']):
        dt = datetime.strptime(file1['createdDate'],"%Y-%m-%dT%H:%M:%S.%fZ")
        if dt > selection_dt:
            selection_id = file1['id']
            selection_dt = dt
        print('    title: %s, id: %s createDate: %s' % (file1['title'], file1['id'], file1['createdDate']))


Matching Files
    title: export-11.zip, id: 0B_EXRCwLorf3UG1SanhScVZ1TVE createDate: 2017-08-12T11:19:15.692Z
    title: export-11.zip, id: 0B_EXRCwLorf3emYyM2Y5RmFnZ00 createDate: 2017-08-11T10:21:51.534Z
    title: export-11.zip, id: 0B_EXRCwLorf3cmZlYWhXUVZMOHM createDate: 2017-08-10T11:31:15.158Z
    title: export-11.zip, id: 0B_EXRCwLorf3TGNuZC1JQWE2UWs createDate: 2017-08-07T18:24:40.990Z
    title: export-11.zip, id: 0B_EXRCwLorf3U0NLTXluUDlmUTA createDate: 2017-08-07T14:18:22.915Z
    title: export-11.zip, id: 0B_EXRCwLorf3TVF4Zkt5X0V5MVE createDate: 2017-08-06T11:45:50.691Z
    title: export-11.zip, id: 0B_EXRCwLorf3REFCSG9XQ04wWDA createDate: 2017-08-05T11:30:20.282Z
    title: export-11.zip, id: 0B_EXRCwLorf3UXNMaUFrcDh3NmM createDate: 2017-08-04T09:53:55.749Z
    title: export-11.zip, id: 0B_EXRCwLorf3YWRkaGRGU0t1UmM createDate: 2017-08-03T11:04:26.398Z
    title: export-11.zip, id: 0B_EXRCwLorf3b2xtcm9CaG1BUGs createDate: 2017-08-02T12:56:31.144Z
    title: export-11.zip, id: 0B_EXRCwLorf3ZURpVWY0ZGdod2s createDate: 2017-08-01T10:38:44.404Z
    title: export-11.zip, id: 0B_EXRCwLorf3NzlPc1ByZzBNM1E createDate: 2017-07-31T12:12:53.571Z
    title: export-11.zip, id: 0B_EXRCwLorf3aWlxdlRHR0ZQUUU createDate: 2017-07-30T11:16:51.081Z
    title: export-11.zip, id: 0B_EXRCwLorf3Vlc0OHhBa2JkYkk createDate: 2017-07-29T10:53:01.569Z
    title: export-11.zip, id: 0B_EXRCwLorf3bk5UdzU2M184aWc createDate: 2017-07-28T18:02:15.345Z
    title: export-11.zip, id: 0B_EXRCwLorf3Z3hERGd3SFZVaFk createDate: 2017-07-27T18:51:01.097Z
    title: export-11.zip, id: 0B_EXRCwLorf3V095MTdoWTVLbTg createDate: 2017-07-26T12:30:12.925Z
    title: export-11.zip, id: 0B_EXRCwLorf3eWdpSlY3VU1IZWs createDate: 2017-07-25T11:21:26.000Z
    title: export-11.zip, id: 0B_EXRCwLorf3MVlrQllTMXB2S1k createDate: 2017-07-24T09:52:26.366Z
    title: export-11.zip, id: 0B_EXRCwLorf3M0t0S3lwWVAzTTA createDate: 2017-07-23T12:16:24.149Z
    title: export-11.zip, id: 0B_EXRCwLorf3SDl2N0pHeUtnTm8 createDate: 2017-07-22T10:31:00.528Z
    title: export-11.zip, id: 0B_EXRCwLorf3MEk1Zno0UURCSDg createDate: 2017-07-21T11:42:59.965Z
    title: export-10.zip, id: 0B_EXRCwLorf3Y0p4MGpTSEoyQlk createDate: 2017-07-20T10:07:03.464Z
    title: export-10.zip, id: 0B_EXRCwLorf3WnJ0cFNtSDJFRGM createDate: 2017-07-19T20:53:43.324Z
    title: export-9.zip, id: 0B_EXRCwLorf3elh0clFKbmM2dlU createDate: 2017-07-19T11:39:25.974Z
    title: export-9.zip, id: 0B_EXRCwLorf3aC1NREhuUFJhbTA createDate: 2017-07-18T12:02:42.631Z
    title: export-8.zip, id: 0B_EXRCwLorf3aFppa0hla1BCTVk createDate: 2017-07-17T18:04:02.804Z
    title: export-7.zip, id: 0B_EXRCwLorf3bzNEdV84NmZxUFE createDate: 2017-07-16T11:11:39.658Z
    title: export-7.zip, id: 0B_EXRCwLorf3eThvMFI4Nk96Nmc createDate: 2017-07-15T21:19:23.211Z
    title: export-6.zip, id: 0B_EXRCwLorf3dmNVV052eDVLNWs createDate: 2017-07-14T20:27:54.409Z
    title: export-6.zip, id: 0B_EXRCwLorf3eTNicTZ4OXkxOUE createDate: 2017-07-14T11:34:00.858Z
    title: export-6.zip, id: 0B_EXRCwLorf3MVNzaklpQjNlRW8 createDate: 2017-07-13T11:17:55.912Z
    title: export-5.zip, id: 0B_EXRCwLorf3UDdOdS1FVDljMlE createDate: 2017-07-12T09:26:12.919Z
    title: export-4.zip, id: 0B_EXRCwLorf3eHVlX3FzN1BrMWc createDate: 2017-07-10T12:07:32.447Z
    title: export-3.zip, id: 0B_EXRCwLorf3WUhOcE1mZzhZTHc createDate: 2017-06-30T10:01:11.615Z
    title: export.zip, id: 0B_EXRCwLorf3aGNVdlNWTWRrdm8 createDate: 2017-06-26T10:57:23.957Z

In [62]:
if not os.path.exists('healthextract'):
    os.mkdir('healthextract')

Download the file from Google Drive

Ensure that the file downloaded is the latest file generated


In [63]:
for file1 in file_list:
        if file1['id'] == selection_id:
            print('Downloading this file: %s, id: %s createDate: %s' % (file1['title'], file1['id'], file1['createdDate']))
            file1.GetContentFile("healthextract/export.zip")


Downloading this file: export-11.zip, id: 0B_EXRCwLorf3UG1SanhScVZ1TVE createDate: 2017-08-12T11:19:15.692Z

Unzip the most current file to a holding directory


In [64]:
zip_ref = zipfile.ZipFile('healthextract/export.zip', 'r')
zip_ref.extractall('healthextract')
zip_ref.close()

Parse Apple Health Export document


In [65]:
path = "healthextract/apple_health_export/export.xml"
e = et.parse(path)
#this was from an older iPhone, to demonstrate how to join files
legacy = et.parse("healthextract/apple_health_legacy/export.xml")

In [66]:
#<<TODO: Automate this process

#legacyFilePath = "healthextract/apple_health_legacy/export.xml"
#if os.path.exists(legacyFilePath):
#    legacy = et.parse("healthextract/apple_health_legacy/export.xml")
#else:
#    os.mkdir('healthextract/apple_health_legacy')

List XML headers by element count


In [67]:
pd.Series([el.tag for el in e.iter()]).value_counts()


Out[67]:
Record             364995
ActivitySummary       242
MetadataEntry         160
Workout                15
Correlation             7
WorkoutEvent            2
ExportDate              1
Me                      1
HealthData              1
dtype: int64

List types for "Record" Header


In [68]:
pd.Series([atype.get('type') for atype in e.findall('Record')]).value_counts()


Out[68]:
HKQuantityTypeIdentifierActiveEnergyBurned           137025
HKQuantityTypeIdentifierBasalEnergyBurned             79875
HKQuantityTypeIdentifierHeartRate                     73792
HKQuantityTypeIdentifierDistanceWalkingRunning        38385
HKQuantityTypeIdentifierStepCount                     29629
HKCategoryTypeIdentifierAppleStandHour                 2960
HKQuantityTypeIdentifierAppleExerciseTime              2091
HKQuantityTypeIdentifierFlightsClimbed                 1096
HKQuantityTypeIdentifierBodyMass                         70
HKQuantityTypeIdentifierBodyTemperature                  11
HKQuantityTypeIdentifierBloodPressureSystolic             7
HKQuantityTypeIdentifierBloodPressureDiastolic            7
HKQuantityTypeIdentifierDietaryCholesterol                2
HKCategoryTypeIdentifierMindfulSession                    2
HKQuantityTypeIdentifierDietaryFatTotal                   2
HKQuantityTypeIdentifierDietaryProtein                    2
HKQuantityTypeIdentifierDietaryCalcium                    2
HKQuantityTypeIdentifierDietaryFatMonounsaturated         2
HKQuantityTypeIdentifierDietaryCarbohydrates              2
HKQuantityTypeIdentifierDietaryFiber                      2
HKQuantityTypeIdentifierDietaryFatSaturated               2
HKQuantityTypeIdentifierDietaryEnergyConsumed             2
HKQuantityTypeIdentifierDietaryPotassium                  2
HKQuantityTypeIdentifierDietarySodium                     2
HKQuantityTypeIdentifierDietaryVitaminC                   2
HKQuantityTypeIdentifierDietarySugar                      2
HKQuantityTypeIdentifierDietaryFatPolyunsaturated         2
HKQuantityTypeIdentifierDietaryIron                       2
HKQuantityTypeIdentifierHeight                            1
dtype: int64

Extract Values to Data Frame

TODO: Abstraction of the next code block


In [69]:
#Extract the heartrate values, and get a timestamp from the xml
# there is likely a more efficient way, though this is very fast
def xmltodf(eltree, element,outvaluename):
    dt = []
    v = []
    for atype in eltree.findall('Record'):
        if atype.get('type') == element:
            dt.append(datetime.strptime(atype.get("startDate"),"%Y-%m-%d %H:%M:%S %z"))
            v.append(atype.get("value"))

    myd = pd.DataFrame({"Create":dt,outvaluename:v})
    colDict = {"Month":"%Y-%m", "Week":"%Y-%U","Day":"%d","Hour":"%H","Days":"%Y-%m-%d"}
    for col, fmt in colDict.items():
        myd[col] = myd['Create'].apply(lambda x: x.strftime(fmt))

#    myd['Month'] = myd['Create'].apply(lambda x: x.strftime('%Y-%m'))
#    myd['Week'] = myd['Create'].apply(lambda x: x.strftime('%Y-%U'))
#    myd['Day'] = myd['Create'].apply(lambda x: x.strftime('%d'))
#    myd['Hour'] = myd['Create'].apply(lambda x: x.strftime('%H'))

    myd[outvaluename] = myd[outvaluename].astype(float).astype(int)
    print('Extracting ' + outvaluename + ', type: ' + element)
  
    return(myd)

HR_df = xmltodf(e,"HKQuantityTypeIdentifierHeartRate","HeartRate")


Extracting HeartRate, type: HKQuantityTypeIdentifierHeartRate

In [70]:
#comment this cell out if no legacy exports.
# extract legacy data, create series for heartrate to join with newer data
HR_df_leg = xmltodf(legacy,"HKQuantityTypeIdentifierHeartRate","HeartRate")
HR_df = pd.concat([HR_df_leg,HR_df])


Extracting HeartRate, type: HKQuantityTypeIdentifierHeartRate

In [71]:
#reset plot - just for tinkering 
plt.rcParams['figure.figsize'] = 16, 8

In [72]:
HR_df.boxplot(by='Month',column="HeartRate", return_type='axes')
plt.grid(axis='x')
plt.title('All Months')
plt.ylabel('Heart Rate')
plt.ylim(40,140)


Out[72]:
(40, 140)

In [73]:
dx = HR_df.boxplot(by='Week',column="HeartRate", return_type='axes')
plt.title('All Weeks')
plt.ylabel('Heart Rate')
plt.xticks(rotation=90)
plt.grid(axis='x')
[plt.axvline(_x, linewidth=1, color='blue') for _x in [35,39,44,47,50]]
plt.ylim(40,140)


Out[73]:
(40, 140)

In [74]:
monthval = '2017-08' 
HR_df[HR_df['Month']==monthval].boxplot(by='Day',column="HeartRate", return_type='axes')
plt.grid(axis='x')
plt.rcParams['figure.figsize'] = 16, 8
plt.title('Daily for Month: '+ monthval)
plt.ylabel('Heart Rate')
plt.ylim(40,140)


Out[74]:
(40, 140)

In [75]:
HR_df[HR_df['Month']==monthval].boxplot(by='Hour',column="HeartRate")
plt.title('Hourly for Month: '+ monthval)
plt.ylabel('Heart Rate')
plt.grid(axis='x')
plt.ylim(40,140)


Out[75]:
(40, 140)

In [76]:
import calmap
ts = pd.Series(HR_df['HeartRate'].values, index=HR_df['Days'])
ts.index = pd.to_datetime(ts.index)
tstot = ts.groupby(ts.index).median()

plt.rcParams['figure.figsize'] = 16, 8
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
calmap.yearplot(data=tstot,year=2017)


Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x144191f98>

Flag Chemotherapy Days for specific analysis

The next two cells provide the ability to introduce cycles that start on specific days and include this data in the datasets so that they can be overlaid in graphics. In the example below, there are three cycles of 21 days. The getDelta function returns the cycle number when tpp == 0 and the days since day 0 when tpp == 2. This allows the overlaying of the cycles, with the days since day 0 being overlaid.


In [77]:
# This isnt efficient yet, just a first swipe. It functions as intended.
def getDelta(res,ttp,cyclelength):
    mz = [x if (x >= 0) & (x < cyclelength) else 999 for x in res]
    if ttp == 0:
        return(mz.index(min(mz))+1)
    else:
        return(mz[mz.index(min(mz))])

chemodays = np.array([date(2017,4,24),date(2017,5,16),date(2017,6,6)])

HR_df = xmltodf(e,"HKQuantityTypeIdentifierHeartRate","HeartRate")
#I dont think this is efficient yet...
a = HR_df['Create'].apply(lambda x: [x.days for x in x.date()-chemodays])
HR_df['ChemoCycle'] = a.apply(lambda x: getDelta(x,0,21))
HR_df['ChemoDays'] = a.apply(lambda x: getDelta(x,1,21))


Extracting HeartRate, type: HKQuantityTypeIdentifierHeartRate

In [78]:
import seaborn as sns
plotx = HR_df[HR_df['ChemoDays']<=21]
plt.rcParams['figure.figsize'] = 24, 8
ax = sns.boxplot(x="ChemoDays", y="HeartRate", hue="ChemoCycle", data=plotx, palette="Set2",notch=1,whis=0,width=0.75,showfliers=False)
plt.ylim(65,130)
#the next statement puts the chemodays variable as a rowname, we need to fix that
plotx_med = plotx.groupby('ChemoDays').median()
#this puts chemodays back as a column in the frame. I need to see if there is a way to prevent the effect
plotx_med.index.name = 'ChemoDays'
plotx_med.reset_index(inplace=True)

snsplot = sns.pointplot(x='ChemoDays', y="HeartRate", data=plotx_med,color='Gray')


Boxplots Using Seaborn


In [79]:
import seaborn as sns
sns.set(style="ticks", palette="muted", color_codes=True)

sns.boxplot(x="Month", y="HeartRate", data=HR_df,whis=np.inf, color="c")
# Add in points to show each observation
snsplot = sns.stripplot(x="Month", y="HeartRate", data=HR_df,jitter=True, size=1, alpha=.15, color=".3", linewidth=0)



In [ ]: