iPhone Productivity

Analysing Apple Health Data in Python – Part 1 : Extraction and Sleep Data

Over the past decade, since Apple launched their health platform in 2016, I’ve collected huge amounts of data on my body. Specifically – with a focus on getting up earlier and earlier again, I thought I would start by extracting the data I do have, and exploring what secrets lie within.

In this post – i’m going to walk you through how to extract your data from Apple Health, parse it into a format that can be read – then analyse this data to find any trends or correlations. This will just be an introduction to

Step 1 : Extracting

Although there is no automated method to get your Apple Health data off your iPhone, its not difficult either, and Apple have made it exceptionally easy to package and extract all your data in one go.

You just need to open the “Health” app on your phone, tap the profile icon in the top right hand corner,

And select “Export All Health Data”

Your phone will then package all your health data as an zipped XML file and allow you to drop this into another app. In my case my file was over 1.6GB, and it was easier to AirDrop this to my Mac to analyse than via email or another method.

Once you have dropped this across, extract the files. You may find you have a bunch of other data (for instance – I had workout routes as well). The main file we’re looking to analyse for today is called “export.xml”.

Step 2 : Convert

As mentioned above, my file was quite large. I could – if I had the time, import this file every time I needed to analyse it, however due to the size it takes around 10-15 minutes to process and load the dataframe in memory. Instead, we’re going to create a short ETL process (that stands for “Extract – Transform – Load”) to extract the information we need out of the XML file, change the structure to match our requirements, then load this into a more efficient binary file format for processing.

To do this you’re going to need the pandas and the pyarrow packages installed. Credit goes to Alejandro Rodríguez over at for providing most of these steps (you can read their more in-depth overview of Apple Health data analysis here). If you want more detail on the below I suggest you head over to their post, but essentially this extracts the data, normalises the dates, and removes some of the “Apple-y” string naming.

import xml.etree.ElementTree as ET
import pandas as pd

# create element tree object 
tree = ET.parse('apple_health_export/export.xml') 

# for every health record, extract the attributes into a dictionary (columns). Then create a list (rows).
root = tree.getroot()
record_list = [x.attrib for x in root.iter('Record')]

# create DataFrame from a list (rows) of dictionaries (columns)
data = pd.DataFrame(record_list)

# proper type to dates
for col in ['creationDate', 'startDate', 'endDate']:
    data[col] = pd.to_datetime(data[col])

# value is numeric, NaN if fails
data['value'] = pd.to_numeric(data['value'], errors='coerce')

# some records do not measure anything, just count occurences
# filling with 1.0 (= one time) makes it easier to aggregate
data['value'] = data['value'].fillna(1.0)

# shorter observation names: use vectorized replace function
data['type'] = data['type'].str.replace('HKQuantityTypeIdentifier', '')
data['type'] = data['type'].str.replace('HKCategoryTypeIdentifier', '')

# save into feather as this is a more efficient data format

Step 3 : Group and Aggregate

For this tutorial – and for my own goals – I’m focusing currently on my sleep data, so in the next few steps we’re going to extract the Apple sleep data, specifically the Apple Watch focused sleep data (as this contains information on how many times I’ve woken up in the night), and produce some derivative variables to analyse.

The first step is to re-load the data from the last script into a new dataframe

import pandas as pd
import datetime

data = pd.read_feather('converted_data/')

From here we want to just filter on the sleep data and specifically the Apple Watch sleep data. Thankfully, Apple provides two columns we can filter on here “type” – which we want to set to “SleepAnalysis” and “sourceName” which we want to set to “Jon’s Apple Watch”.

# filter on sleep data and apple watch info ONLY
sleep_data = data[data['type'] == "SleepAnalysis"]
sleep_data = sleep_data[sleep_data['sourceName'] == "Jon’s Apple Watch"]

This gives us a dataframe that looks something like the following;

4014637SleepAnalysisJon’s Apple Watch72020-09-18 05:48:57+01:002020-09-17 22:33:31+01:002020-09-17 23:57:01+01:001
4014638SleepAnalysisJon’s Apple Watch72020-09-18 05:48:57+01:002020-09-18 00:00:31+01:002020-09-18 00:45:01+01:001
4014639SleepAnalysisJon’s Apple Watch72020-09-18 05:48:57+01:002020-09-18 00:47:01+01:002020-09-18 01:59:31+01:001
4014640SleepAnalysisJon’s Apple Watch72020-09-18 05:48:57+01:002020-09-18 02:01:31+01:002020-09-18 02:29:01+01:001
4014641SleepAnalysisJon’s Apple Watch72020-09-18 05:48:57+01:002020-09-18 02:32:01+01:002020-09-18 05:47:31+01:001

Nothing too useful yet, but a few items to note;

  • startDate – provides the start of a sleep cycle
  • endData – provides, you’ve guessed it, the end of a sleep cycle
  • creationDate – provides when this data was pushed from the Apple Watch to the Apple Health app. Usefully – this is provided in a batch in the morning – making this a great field to group data by.

Although this may only be 2 fields, we can still derive a lot of data from this. I looking to try and analyse any sleep patters such as average wake time, restlessness and so on. To do this, we’re going to create the following fields;

  • Time Asleep – How long am I asleep for in each record (endDate minus startDate)
  • Total Time Asleep – How long am I asleep for in total each night (sum of time asleep for the night)
  • Bed Time – When do I actually hit the hay and get to bed? (minimum startDate for the night)
  • Awake Time – When does my watch detect when I am awake (maximum endDate for the night)
  • Sleep Counts – How many times am I restless during the night (count of records for the night)
  • REM cycles – How many REM (deep sleep) cycles have I got (number of records over 90 minutes, divided by 90 minutes – this is the most complex of the aggregations)
  • Total Time in Bed – How long did I spend in bed (awake time minus bed time)
  • Restless time – How restless was I during the night (total time in bed minus total time asleep )

8 fields from 2 data points – thats pretty cool! Lets get into the code and how I’ve calculated each one.

# calulate time between date(s)
sleep_data['time_asleep'] = sleep_data['endDate'] - sleep_data['startDate']

# records are grouped by creation date, so lets used that to sum up the values we need here
# total time asleep as a sum of the asleep time
# awake and bed times are max's and min's
# sleep count is the number of times the Apple Watch detected movement
# rem is the number of sleep cycles over 90 minutes (divded by 90 if they were longer than 1 cycle)
sleep_data = sleep_data.groupby('creationDate').agg(total_time_asleep=('time_asleep', 'sum'),
    bed_time=('startDate', 'min'), 
    awake_time=('endDate', 'max'), 
    rem_cycles=pd.NamedAgg(column='time_asleep', aggfunc=lambda x: (x // datetime.timedelta(minutes=90)).sum()))

# Time in Bed will be different to Apple's reported figure - 
# as Apple uses the time you place your iPhone down as an additional 
# datapoint, which of course, is incorrect if you try to maintain 
# some device separation in the evenings.
# For now - we will just use Apple Watch data here
sleep_data['time_in_bed'] = sleep_data['awake_time'] - sleep_data['bed_time']
sleep_data['restless_time'] = sleep_data['time_in_bed'] - sleep_data['total_time_asleep']

Step 4 : Analyse

Now we’ve pivoted and aggregated our data, the next step is to identify any trends from our data. I’m going to be doing some very basic initial analysis here – and I’ll leave the deeper dive to a later post, but for now, lets look for any overall trends in the data using Matplotlib.

# convert time duration to minutes for easier plotting and comparison
sleep_data['time_in_bed'] = (sleep_data['time_in_bed'].dt.total_seconds()/60)
sleep_data['total_time_asleep'] = (sleep_data['total_time_asleep'].dt.total_seconds()/60)

import matplotlib
import matplotlib.pyplot as plt

from matplotlib.dates import DateFormatter, MonthLocator

chart1 = sleep_data[['time_in_bed','total_time_asleep']].plot(use_index=True)

As we can see from this initial chart, my sleep time is very cyclical based on the week, and I had a major crash around last September (this was when I had some big issues with my teeth – requiring a root canal). In later posts I’ll be looking to diving into these trends a little deeper – using AI to identify any trends throughout the months, weeks and years.

If you’ve founds todays analysis useful, feel free to drop me a comment on twitter @busbyjon to follow future posts on this subject and others in this series.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.