The first part of this program is all about getting the Apple Health export and putting it into an analyzable format. At that point it can be analysed anywhere. The second part of this program is concerned with using SAS Scripting Wrapper for Analytics Transfer (SWAT) Python library to transfer the data to SAS Viya, and analyze it there. The SWAT package provides native python language access to the SAS Viya codebase.
This file was created from a desire to get my hands on data collected by Apple Health, notably heart rate information collected by Apple Watch. For this to work, this file needs to be in a location accessible to Python code. A little bit of searching told me that iCloud file access is problematic and that there were already a number of ways of doing this with the Google API if the file was saved to Google Drive. I chose PyDrive. So for the end to end program to work with little user intervention, you will need to sign up for Google Drive, set up an application in the Google API and install Google Drive app to your iPhone.
This may sound involved, and it is not necessary if you simply email the export file to yourself and copy it to a filesystem that Python can see. If you choose to do that, all of the Google Drive portion can be removed. I like the Google Drive process though as it enables a minimal manual work scenario.
This version requires the user to grant Google access, requiring some additional clicks, but it is not too much. I think it is possible to automate this to run without user intervention as well using security files.
The first step to enabling this process is exporting the data from Apple Health. As of this writing, open Apple Health and click on your user icon or photo. Near the bottom of the next page in the app will be a button or link called Export Health Data. Clicking on this will generate a xml file, zipped up. THe next dialog will ask you where you want to save it. Options are to email, save to iCloud, message etc... Select Google Drive. Google Drive allows multiple files with the same name and this is accounted for by this program.
In [58]:
import xml.etree.ElementTree as et
import pandas as pd
import numpy as np
from datetime import *
import matplotlib.pyplot as plt
import re
import os.path
import zipfile
import pytz
%matplotlib inline
plt.rcParams['figure.figsize'] = 16, 8
This will open a browser to let you beging the process of authentication with an existing Google Drive account. This process will be separate from Python. For this to work, you will need to set up a Other Authentication OAuth credential at https://console.developers.google.com/apis/credentials, save the secret file in your root directory and a few other things that are detailed at https://pythonhosted.org/PyDrive/. The PyDrive instructions also show you how to set up your Google application. There are other methods for accessing the Google API from python, but this one seems pretty nice. The first time through the process, regular sign in and two factor authentication is required (if you require two factor auth) but after that it is just a process of telling Google that it is ok for your Google application to access Drive.
In [59]:
# Authenticate into Google Drive
from pydrive.auth import GoogleAuth
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
Now that we are authenticated into Google Drive, use PyDrive to access the API and get to files stored.
Google Drive allows multiple files with the same name, but it indexes them with the ID to keep them separate. In this block, we make one pass of the file list where the file name is called export.zip, and save the row that corresponds with the most recent date. We will use that file id later to download the correct file that corresponds with the most recent date. Apple Health export names the file export.zip, and at the time this was written, there is no other option.
In [60]:
from pydrive.drive import GoogleDrive
drive = GoogleDrive(gauth)
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
# Step through the file list and find the most current export.zip file id, then use
# that later to download the file to the local machine.
# This may look a little old school, but these file lists will never be massive and
# it is readable and easy one pass way to get the most current file using the
# least (or low) amount of resouces
selection_dt = datetime.strptime("2000-01-01T01:01:01.001Z","%Y-%m-%dT%H:%M:%S.%fZ")
print("Matching Files")
for file1 in file_list:
if re.search("^export-*\d*.zip",file1['title']):
dt = datetime.strptime(file1['createdDate'],"%Y-%m-%dT%H:%M:%S.%fZ")
if dt > selection_dt:
selection_id = file1['id']
selection_dt = dt
print(' title: %s, id: %s createDate: %s' % (file1['title'], file1['id'], file1['createdDate']))
In [61]:
if not os.path.exists('healthextract'):
os.mkdir('healthextract')
In [62]:
for file1 in file_list:
if file1['id'] == selection_id:
print('Downloading this file: %s, id: %s createDate: %s' % (file1['title'], file1['id'], file1['createdDate']))
file1.GetContentFile("healthextract/export.zip")
In [63]:
zip_ref = zipfile.ZipFile('healthextract/export.zip', 'r')
zip_ref.extractall('healthextract')
zip_ref.close()
In [64]:
path = "healthextract/apple_health_export/export.xml"
e = et.parse(path)
#this was from an older iPhone, to demonstrate how to join files
legacy = et.parse("healthextract/apple_health_legacy/export.xml")
In [65]:
#<<TODO: Automate this process
#legacyFilePath = "healthextract/apple_health_legacy/export.xml"
#if os.path.exists(legacyFilePath):
# legacy = et.parse("healthextract/apple_health_legacy/export.xml")
#else:
# os.mkdir('healthextract/apple_health_legacy')
In [66]:
pd.Series([el.tag for el in e.iter()]).value_counts()
Out[66]:
In [67]:
pd.Series([atype.get('type') for atype in e.findall('Record')]).value_counts()
Out[67]:
In [68]:
import pytz
#Extract the heartrate values, and get a timestamp from the xml
# there is likely a more efficient way, though this is very fast
def txloc(xdate,fmt):
eastern = pytz.timezone('US/Eastern')
dte = xdate.astimezone(eastern)
return datetime.strftime(dte,fmt)
def xmltodf(eltree, element,outvaluename):
dt = []
v = []
for atype in eltree.findall('Record'):
if atype.get('type') == element:
dt.append(datetime.strptime(atype.get("startDate"),"%Y-%m-%d %H:%M:%S %z"))
v.append(atype.get("value"))
myd = pd.DataFrame({"Create":dt,outvaluename:v})
colDict = {"Year":"%Y","Month":"%Y-%m", "Week":"%Y-%U","Day":"%d","Hour":"%H","Days":"%Y-%m-%d","Month-Day":"%m-%d"}
for col, fmt in colDict.items():
myd[col] = myd['Create'].dt.tz_convert('US/Eastern').dt.strftime(fmt)
myd[outvaluename] = myd[outvaluename].astype(float).astype(int)
print('Extracting ' + outvaluename + ', type: ' + element)
return(myd)
HR_df = xmltodf(e,"HKQuantityTypeIdentifierHeartRate","HeartRate")
In [69]:
EX_df = xmltodf(e,"HKQuantityTypeIdentifierAppleExerciseTime","Extime")
EX_df.head()
Out[69]:
In [70]:
#comment this cell out if no legacy exports.
# extract legacy data, create series for heartrate to join with newer data
#HR_df_leg = xmltodf(legacy,"HKQuantityTypeIdentifierHeartRate","HeartRate")
#HR_df = pd.concat([HR_df_leg,HR_df])
In [71]:
#import pytz
#eastern = pytz.timezone('US/Eastern')
#st = datetime.strptime('2017-08-12 23:45:00 -0400', "%Y-%m-%d %H:%M:%S %z")
#ed = datetime.strptime('2017-08-13 00:15:00 -0400', "%Y-%m-%d %H:%M:%S %z")
#HR_df['c2'] = HR_df['Create'].dt.tz_convert('US/Eastern').dt.strftime("%Y-%m-%d")
In [72]:
#HR_df[(HR_df['Create'] >= st) & (HR_df['Create'] <= ed) ].head(10)
In [73]:
#reset plot - just for tinkering
plt.rcParams['figure.figsize'] = 30, 8
In [74]:
HR_df.boxplot(by='Month',column="HeartRate", return_type='axes')
plt.grid(axis='x')
plt.title('All Months')
plt.ylabel('Heart Rate')
plt.ylim(40,140)
Out[74]:
In [75]:
dx = HR_df[HR_df['Year']=='2019'].boxplot(by='Week',column="HeartRate", return_type='axes')
plt.title('All Weeks')
plt.ylabel('Heart Rate')
plt.xticks(rotation=90)
plt.grid(axis='x')
[plt.axvline(_x, linewidth=1, color='blue') for _x in [10,12]]
plt.ylim(40,140)
Out[75]:
In [76]:
monthval = '2019-03'
#monthval1 = '2017-09'
#monthval2 = '2017-10'
#HR_df[(HR_df['Month']==monthval1) | (HR_df['Month']== monthval2)].boxplot(by='Month-Day',column="HeartRate", return_type='axes')
HR_df[HR_df['Month']==monthval].boxplot(by='Month-Day',column="HeartRate", return_type='axes')
plt.grid(axis='x')
plt.rcParams['figure.figsize'] = 16, 8
plt.title('Daily for Month: '+ monthval)
plt.ylabel('Heart Rate')
plt.xticks(rotation=90)
plt.ylim(40,140)
Out[76]:
In [53]:
HR_df[HR_df['Month']==monthval].boxplot(by='Hour',column="HeartRate")
plt.title('Hourly for Month: '+ monthval)
plt.ylabel('Heart Rate')
plt.grid(axis='x')
plt.ylim(40,140)
Out[53]:
import calmap ts = pd.Series(HR_df['HeartRate'].values, index=HR_df['Days']) ts.index = pd.to_datetime(ts.index) tstot = ts.groupby(ts.index).median()
plt.rcParams['figure.figsize'] = 16, 8 import warnings warnings.simplefilter(action='ignore', category=FutureWarning) calmap.yearplot(data=tstot,year=2017)
The next two cells provide the ability to introduce cycles that start on specific days and include this data in the datasets so that they can be overlaid in graphics. In the example below, there are three cycles of 21 days. The getDelta function returns the cycle number when tpp == 0 and the days since day 0 when tpp == 2. This allows the overlaying of the cycles, with the days since day 0 being overlaid.
In [21]:
# This isnt efficient yet, just a first swipe. It functions as intended.
def getDelta(res,ttp,cyclelength):
mz = [x if (x >= 0) & (x < cyclelength) else 999 for x in res]
if ttp == 0:
return(mz.index(min(mz))+1)
else:
return(mz[mz.index(min(mz))])
#chemodays = np.array([date(2017,4,24),date(2017,5,16),date(2017,6,6),date(2017,8,14)])
chemodays = np.array([date(2018,1,26),date(2018,2,2),date(2018,2,9),date(2018,2,16),date(2018,2,26),date(2018,3,2),date(2018,3,19),date(2018,4,9),date(2018,5,1),date(2018,5,14),date(2018,6,18),date(2018,7,10),date(2018,8,6)])
HR_df = xmltodf(e,"HKQuantityTypeIdentifierHeartRate","HeartRate")
#I dont think this is efficient yet...
a = HR_df['Create'].apply(lambda x: [x.days for x in x.date()-chemodays])
HR_df['ChemoCycle'] = a.apply(lambda x: getDelta(x,0,21))
HR_df['ChemoDays'] = a.apply(lambda x: getDelta(x,1,21))
In [22]:
import seaborn as sns
plotx = HR_df[HR_df['ChemoDays']<=21]
plt.rcParams['figure.figsize'] = 24, 8
ax = sns.boxplot(x="ChemoDays", y="HeartRate", hue="ChemoCycle", data=plotx, palette="Set2",notch=1,whis=0,width=0.75,showfliers=False)
plt.ylim(65,130)
#the next statement puts the chemodays variable as a rowname, we need to fix that
plotx_med = plotx.groupby('ChemoDays').median()
#this puts chemodays back as a column in the frame. I need to see if there is a way to prevent the effect
plotx_med.index.name = 'ChemoDays'
plotx_med.reset_index(inplace=True)
snsplot = sns.pointplot(x='ChemoDays', y="HeartRate", data=plotx_med,color='Gray')
In [23]:
import seaborn as sns
sns.set(style="ticks", palette="muted", color_codes=True)
sns.boxplot(x="Month", y="HeartRate", data=HR_df,whis=np.inf, color="c")
# Add in points to show each observation
snsplot = sns.stripplot(x="Month", y="HeartRate", data=HR_df,jitter=True, size=1, alpha=.15, color=".3", linewidth=0)
In [24]:
hr_only = HR_df[['Create','HeartRate']]
hr_only.tail()
Out[24]:
In [25]:
hr_only.to_csv('~/Downloads/stc_hr.csv')
In [ ]: