Guide to Extracting Data w/ APIs from the Withings ScanWatch§

900786816616432c8bbc3702974b567c

Ever wanted a smart watch with an EKG? Introducing the Withings ScanWatch, a smart watch that can not only detect your heart rate optically, but also with an EKG! While this notebook is meant for the ScanWatch, it can be easily adapted to any other Withings product with some modifications.

Also note that you need to set up your Withings ScanWatch prior to running this notebook.

If you want to know more about Withings Body+ Scale, see the README for a detailed analysis of performances, sensors, data privacy, and extraction pipelines.

We will be able to extract the following parameters (see the definitions at this documenation page):

Click here to view all extractable parameters

Parameter Name	Sampling Frequency
Heart Rate	Every 10 minutes OR every second (when set to continuous heart rate mode)
# of REM sleep phases	Per sleep
Sleep Efficiency	Per sleep
Sleep Latency	Per sleep
Total Sleep Time	Per sleep
Total Time in bed	Per sleep
Wakeup latency	Per sleep
Waso	Per sleep
Asleepduration	Per sleep
Deep sleep duration	Per sleep
Duration to sleep	Per sleep
Duration to wakeup	Per sleep
Average heart rate	Per sleep
Max heart rate	Per sleep
Min heart rate	Per sleep
Light sleep duration	Per sleep
Night events	Per sleep
Out of bed count	Per sleep
REM sleep duration	Per sleep
Average resp rate	Per sleep
Minimal resp. rate	Per sleep
Max resp. rate	Per sleep
Sleep score	Per sleep
Total snoring time	Per sleep
Snoring episode count	Per sleep
Wakeup count	Per sleep
Wakeup duration	Per sleep

Note that Withings provides even more measurements than just these. You can check these out at the API reference. Since we focus on heart rate and sleep here, though, those are the main measurement types we extract.

In this guide, we sequentially cover the following five topics to extract from the Withings API: 1. Setup - We cover setting up the watch and library imports in this section. 2. Authentication/Authorization - This requires a couple extra steps on your part 3. Data extraction - You can get data from the API in a couple lines of code. 4. Data visualization - 4.1: We reproduce a plot for heart rate over the course of a day - 4.2: We reproduce a plot for sleep data over the course of a week - 4.3: We visualize the times / days when the user fails to wear the watch. 5. Data analysis - 5.1: We try to find a correlation between the length of a sleep period and the median heart rate for that sleep period. We find that the correlation is not statistically significant. - 5.2: We demonstrate a simple outlier detection algorithm running on the dataset.

Disclaimer: this notebook is purely for educational purposes. All of the data currently stored in this notebook is purely synthetic, meaning randomly generated according to rules we created. Despite this, the end-to-end data extraction pipeline has been tested on our own data, meaning that if you enter your own credentials on your own Colab instance, you can visualize your own real data. That being said, we were unable to thoroughly test the timezone functionality, though, since we only have one account, so beware.

1. Setup§

1.1 Study participant setup and usage§

To set up the watch itself so that you can run this notebook, download the Health Mate app and follow the instructions on the app. The app will pair via bluetooth to the watch. In our experience, this process was fairly straightforward.

2. Authentication/Authorization§

To be able to make requests to the API, the easiest way is to use the public developer API. This section roughly follows the steps outlined here on their website.

First, follow the non-colab steps listed below:

Visit the developer portal and click “Open Developer Dashboard” on the top right.
Once logged in, click “Add an app”.
For now, you can just click “I don’t know” under “Services”, accept terms of use, and click “Next”.
Put whatever you want under “Application Name” (we used withings-test), anything under “Application Description”, and “https://wbsapi.withings.net/v2/oauth2” under Registered URLs, then click “Done”.
- NOTE: “registered URLs” is intended to be a URL to a webserver you control and can receive requests from. However, in this notebook we are simply using it as a placeholder, as this functionality is not strictly necessary for obtaining your data.

In the end, you should see something like the below.

615ef4aba8d543b5932fa75c7ba5e47d

Now we can proceed with the rest of the notebook.

To be able to make requests to the API and extract the data we need, we need to first issue an access token. This (ephemeral) access token will serve as our key to the data. While, you don’t necessarily need to be familiar with how the issuing of the authtoken occurs, you can learn more about it by visiting the official Withings tutorial.

[ ]:

#@title 6. Enter your credentials below (from the application you just created)

# we hide a bunch of the library imports in here :)
!pip install -q july

import requests
import urllib
import json
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.ndimage import gaussian_filter
from scipy import stats
import july
from july.utils import date_range

CLIENT_ID = "" #@param {type:"string"}
CUSTOMER_SECRET = "" #@param {type:"string"}

STATE = 'string'
ACCOUNT_URL = 'https://account.withings.com'
CALLBACK_URI = 'https://wbsapi.withings.net/v2/oauth2'


payload = {'response_type': 'code',  # imposed string by the api
    'client_id': CLIENT_ID,
    'state': STATE,
    'scope': 'user.info,user.metrics,user.activity',  # see docs (https://developer.withings.com/api-reference/#operation/oauth2-authorize) for enhanced scope
    'redirect_uri': CALLBACK_URI,  # URL of this app
    #'mode': 'demo'  # Use demo mode, DELETE THIS FOR REAL APP
}

url = f'{ACCOUNT_URL}/oauth2_user/authorize2?'

for key, value in payload.items():
    url += f'{key}={value}&'

url = url[:-1]

print(url)

https://account.withings.com/oauth2_user/authorize2?response_type=code&client_id=&state=string&scope=user.info,user.metrics,user.activity&redirect_uri=https://wbsapi.withings.net/v2/oauth2

Now visit the above URL and click “Allow this app”, and copy the URL you were redirected to into the text field below. Note that if you mess up once, you have to go through the above URL again (including clicking “Allow this app”). Also, the URL is only valid for 30 seconds, so be quick!

[ ]:

#@title 7. Copy and paste the URL you were redirected to below
redirect_url = "https://wbsapi.withings.net/v2/oauth2?code=09f095bcd678c7beaa9285e189c5a3917329ed7d&state=string" #@param {type:"string"}

try:
    code = urllib.parse.parse_qs(urllib.parse.urlparse(redirect_url).query)['code'][0]
except Exception as e:
    print(f'Caught error:\n{e}\n')
    print("Please copy and paste the entire URL (including https)")

params = {
    'action': 'requesttoken',
    'grant_type': 'authorization_code',
    'client_id': CLIENT_ID,
    'client_secret': CUSTOMER_SECRET,
    'code': code,
    #'scope': 'user.info',
    'redirect_uri': 'https://wbsapi.withings.net/v2/oauth2'
}

out = requests.get('https://wbsapi.withings.net/v2/oauth2', data=params)

out = json.loads(out.text)

try:
    access_token = out['body']['access_token']
except KeyError as e:
    print('Took too long to paste in redirect URL. Please repeat step 7.')

Took too long to paste in redirect URL. Please repeat step 7.

Now that we have our access token, we can begin making requests to the API! This access token will last only three hours, though, so you would need to re-do step 7 if three hours pass.

3. Data extraction§

Here, data extraction is pretty simple! We’ve made it possible to get all heart rate and sleep data in one function call each.

If you need to customize this part further and need to dig into the code, notice that all we need to do is make a few GET requests with the right query parameters. See the overall health data API page or the “Measure” endpoints specifically for more info.

[ ]:

#@title Enter start and end dates
start_date = "2020-05-20" #@param {type:"date"}
end_date = "2022-07-20" #@param {type:"date"}

import requests
import urllib
import json
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.ndimage import gaussian_filter
from scipy import stats
import july
from july.utils import date_range

from datetime import timedelta, datetime
import pandas as pd
import numpy as np

def create_synthetic_sleeps_df():

    syn_sleeps = pd.DataFrame()

    syn_sleeps['id'] = np.random.randint(0, 100000000, size=(20,))
    syn_sleeps['timezone'] = 'America/Los_Angeles'
    syn_sleeps['model'] = 16
    syn_sleeps['model_id'] = 93
    syn_sleeps['hash_deviceid'] = 'd41d8cd98f00b204e9800998ecf8427e'  # randomly generated
    syn_sleeps['date'] = [datetime.fromtimestamp(datetime.strptime('2022-05-17', '%Y-%m-%d').timestamp() + i * 24 * 3600).strftime('%Y-%m-%d') for i in range(20)]


    startdates = []
    enddates = []

    for date in syn_sleeps['date']:
        sleep_start = np.random.randint(20, 27)
        sleep_time = np.random.randint(4, 9)

        startdate = datetime.strptime(date, '%Y-%m-%d') + timedelta(hours=sleep_start + 7)

        enddate = startdate + timedelta(hours=sleep_time)

        startdate, enddate = int(startdate.timestamp()), int(enddate.timestamp())

        startdates.append(startdate)
        enddates.append(enddate)

    all_data = []

    for i in range(20):

        data = {
            'wakeupduration': np.random.randint(0, 3000),
            'wakeupcount': np.random.poisson(1),
            'durationtosleep': np.random.randint(120, 180),
            'durationtowakeup': np.random.randint(0, 700),
            'total_timeinbed': np.random.randint(10000, 50000),
            'total_sleep_time': np.random.randint(10000, 50000),
            'sleep_efficiency': np.random.rand() * .1 + .9,
            'sleep_latency': np.random.randint(120, 130),
            'wakeup_latency': np.random.randint(0, 800),
            'waso': np.random.randint(0, 4000),
            'nb_rem_episodes': 0,
            'out_of_bed_count': 0,
            'lightsleepduration': np.random.randint(6000, 35000),
            'deepsleepduration': np.random.randint(3000, 17000),
            'hr_average': np.random.randint(55, 65),
            'hr_min': np.random.randint(40, 60),
            'hr_max': np.random.randint(70, 120),
            'sleep_score': np.random.randint(30, 80)

        }

        all_data.append(data)

    syn_sleeps['startdate'] = startdates
    syn_sleeps['enddate'] = enddates

    syn_sleeps['data'] = all_data

    syn_sleeps['created'] = enddates
    syn_sleeps['modified'] = enddates

    return syn_sleeps

from tqdm.notebook import tqdm
import hashlib

def create_syn_hr(syn_sleeps):
    start_day = datetime.strptime('2022-05-20', '%Y-%m-%d')

    hour_usage = [0.8] * 3 + [0.9] * 7 + [1.0] * 10 + [0.9] * 4

    datetimes = []

    for day_offset in tqdm(range(20)):
        for hour_offset in range(24):
            for minute_offset in range(0, 60, 10):
                day = start_day + timedelta(days=day_offset)
                hour = day + timedelta(hours=hour_offset)
                minute = hour + timedelta(minutes=minute_offset)

                if np.random.uniform(0, 1) < hour_usage[hour_offset]:
                    datetimes.append(minute)

    hr_measurements = (np.random.randn(len(datetimes)) * 5 + 90).astype('int')

    timestamps = np.array([dt.timestamp() for dt in datetimes])

    for i, (startdate, enddate) in tqdm(enumerate(zip(syn_sleeps.startdate, syn_sleeps.enddate))):
        idxes = np.where(np.logical_and(timestamps > startdate, timestamps < enddate))[0]
        duration = (enddate - startdate) / 3600
        avg_hr = -5/7 * duration + 64.1428571429

        hr_measurements[idxes] = (np.random.randn(idxes.shape[0]) + avg_hr).astype('int')

    syn_hr = pd.DataFrame()
    syn_hr['datetime'] = datetimes
    syn_hr['heart_rate'] = hr_measurements

    syn_hr['model'] = 'ScanWatch'
    syn_hr['model_id'] = 93
    syn_hr['deviceid'] = hashlib.md5().hexdigest()

    num_garbage = 1000

    timestamps_garbage = np.random.uniform((start_day - timedelta(days=100)).timestamp(), start_day.timestamp(), size=num_garbage)

    garbage_df = pd.DataFrame()
    garbage_df['datetime'] = [datetime.fromtimestamp(int(ts)) for ts in timestamps_garbage]
    garbage_df['heart_rate'] = (np.random.randn(num_garbage) * 5 + 90).astype('int')

    garbage_df['model'] = None
    garbage_df['model_id'] = 1059
    garbage_df['deviceid'] = None


    syn_hr = pd.concat((garbage_df, syn_hr), ignore_index=True)

    return syn_hr

#########################################
# real code below, synthetic code above #
#########################################

num_to_description = {1: 'Weight (kg)',
                      4: 'Height (meter)',
                      5: 'Fat Free Mass (kg)',
                      6: 'Fat Ratio (%)',
                      8: 'Fat Mass Weight (kg)',
                      9: 'Diastolic Blood Pressure (mmHg)',
                      10: 'Systolic Blood Pressure (mmHg)',
                      11: 'Heart Pulse (bpm) - only for BPM and scale devices',
                      12: 'Temperature (celsius)',
                      54: 'SP02 (%)',
                      71: 'Body Temperature (celsius)',
                      73: 'Skin Temperature (celsius)',
                      76: 'Muscle Mass (kg)',
                      77: 'Hydration (kg)',
                      88: 'Bone Mass (kg)',
                      91: 'Pulse Wave Velocity (m/s)',
                      123: 'VO2 max is a numerical measurement of your body’s ability to consume oxygen (ml/min/kg).',
                      135: 'QRS interval duration based on ECG signal',
                      136: 'PR interval duration based on ECG signal',
                      137: 'QT interval duration based on ECG signal',
                      138: 'Corrected QT interval duration based on ECG signal',
                      139: 'Atrial fibrillation result from PPG'}

NUM_RETRIES = 3

def fetch_all_wrapper(endpoint_url, data, headers, arr_key, parse_data=lambda x: x):
    # wrapper around public API that retrieves arbitrarily large # of
    # records, since there is a restriction of # of records per API response
    # NOTES:
    # out['body'][arr_key] is concatenated across several requests
    # parse_data is a function that parses the returned array

    cur_offset = 0
    arr_complete = None

    arr = None

    while True:
        # endpoint can be flaky if the response payload is extremely large,
        # so retry at most NUM_RETRIES times
        for i in range(NUM_RETRIES):
            data_args = {
                **data,
                'offset': cur_offset
            }

            out = requests.post(endpoint_url, data=data_args, headers=headers)

            out = json.loads(out.text)

            if out['status'] == 401:
                raise Exception(f'request response is {out} for request {data_args} to endpoint {endpoint_url}, headers {headers}')

            try:
                arr = parse_data(out['body'][arr_key])
                break
            except KeyError:
                if 'body' in out.keys():
                    raise Exception(f'got key {arr_key}, expected one of {out["body"].keys()}')
                elif out['status'] == 2555:
                    # when the payload is too large, this is the status code
                    continue
                else:
                    raise Exception(f'request response is {out} for request {data_args} to endpoint {endpoint_url}, headers {headers}')

        # for example, https://developer.withings.com/api-reference/#operation/measurev2-getactivity
        # vs. https://developer.withings.com/api-reference/#operation/measure-getmeas

        if arr is not None:
            if type(arr) == type({}):
                if arr_complete is None:
                    arr_complete = dict()

                arr_complete.update(arr)

            elif type(arr) == type([]):
                if arr_complete is None:
                    arr_complete = []

                arr_complete += arr

            # continue if there's still more to get
            if 'more' in out['body'].keys() and out['body']['more'] == 1:
                cur_offset = out['body']['offset']
            else:
                break
        else:
            break

    # replace with concatenated version
    if arr_complete is not None:
        out['body'][arr_key] = arr_complete

    return out

def fetch_all_heart_rate(start='2020-03-10', end='2022-05-28'):
    # get all dates heart rate was collected for
    out = fetch_all_wrapper('https://wbsapi.withings.net/v2/measure', {
        'action': 'getactivity',
        'startdateymd': start,
        'enddateymd': end,
        'data_fields': 'hr_average'
    }, {'Authorization': f'Bearer {access_token}'},
        arr_key='activities')

    dates = [act['date'] for act in out['body']['activities']]

    # now for each date get the heart rate data and store as list of dicts
    dict_list = []
    for date in tqdm(dates):
        out = fetch_all_wrapper('https://wbsapi.withings.net/v2/measure', {
            'action': 'getintradayactivity',
            'startdate': int(datetime.strptime(date, '%Y-%m-%d').timestamp()),
            'enddate': int(datetime.strptime(date, '%Y-%m-%d').timestamp()) + 24 * 3600,
            'data_fields': 'heart_rate'
        }, {'Authorization': f'Bearer {access_token}'},
            arr_key='series')


        if 'body' in out.keys():
            dict_list += [{'datetime': datetime.fromtimestamp(int(k)), **v} for k,v in out['body']['series'].items()]

    df = pd.DataFrame.from_dict(dict_list)

    return df


def fetch_all_sleeps(start='2020-03-10', end='2022-05-28'):
    out = fetch_all_wrapper('https://wbsapi.withings.net/v2/sleep', {
        'action': 'getsummary',
        'startdateymd': '2020-07-01',
        'enddateymd': '2022-07-01',
        'data_fields': 'nb_rem_episodes,sleep_efficiency,sleep_latency,total_sleep_time,total_timeinbed,wakeup_latency,waso,asleepduration,deepsleepduration,durationtosleep,durationtowakeup,hr_average,hr_max,hr_min,lightsleepduration,night_events,out_of_bed_count,remsleepduration,rr_average,rr_max,rr_min,sleep_score,snoring,snoringepisodecount,wakeupcount,wakeupduration'
    }, {'Authorization': f'Bearer {access_token}'}, arr_key='series')

    df = pd.DataFrame.from_dict(out['body']['series'])

    return df

SEED = 3

np.random.seed(SEED)

[ ]:

synthetic = True

if synthetic:
    sleeps_df = create_synthetic_sleeps_df()
    hr_df = create_syn_hr(sleeps_df)
else:
    hr_df = fetch_all_heart_rate(start=start_date, end=end_date)
    sleeps_df = fetch_all_sleeps(start=start_date, end=end_date)

[ ]:

hr_df

	datetime	heart_rate	model	model_id	deviceid
0	2022-04-09 10:14:45	88	None	1059	None
1	2022-03-29 12:15:31	86	None	1059	None
2	2022-03-24 01:02:15	89	None	1059	None
3	2022-03-01 23:47:02	88	None	1059	None
4	2022-04-30 03:46:42	87	None	1059	None
...	...	...	...	...	...
3672	2022-06-08 22:30:00	87	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3673	2022-06-08 22:40:00	89	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3674	2022-06-08 23:00:00	96	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3675	2022-06-08 23:10:00	90	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3676	2022-06-08 23:30:00	89	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e

3677 rows × 5 columns

[ ]:

sleeps_df

	id	timezone	model	model_id	hash_deviceid	date	startdate	enddate	data	created	modified
0	83957610	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-17	1652850000	1652871600	{'wakeupduration': 2779, 'wakeupcount': 2, 'du...	1652871600	1652871600
1	35325592	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-18	1652943600	1652961600	{'wakeupduration': 858, 'wakeupcount': 1, 'dur...	1652961600	1652961600
2	88681721	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-19	1653026400	1653048000	{'wakeupduration': 1519, 'wakeupcount': 0, 'du...	1653048000	1653048000
3	41466808	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-20	1653120000	1653134400	{'wakeupduration': 1202, 'wakeupcount': 0, 'du...	1653134400	1653134400
4	46504192	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-21	1653210000	1653228000	{'wakeupduration': 939, 'wakeupcount': 1, 'dur...	1653228000	1653228000
5	29254421	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-22	1653292800	1653314400	{'wakeupduration': 489, 'wakeupcount': 0, 'dur...	1653314400	1653314400
6	77081597	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-23	1653361200	1653386400	{'wakeupduration': 2876, 'wakeupcount': 0, 'du...	1653386400	1653386400
7	91453130	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-24	1653469200	1653483600	{'wakeupduration': 1023, 'wakeupcount': 0, 'du...	1653483600	1653483600
8	80522091	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-25	1653544800	1653562800	{'wakeupduration': 2442, 'wakeupcount': 0, 'du...	1653562800	1653562800
9	2513913	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-26	1653631200	1653645600	{'wakeupduration': 865, 'wakeupcount': 1, 'dur...	1653645600	1653645600
10	40289961	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-27	1653724800	1653753600	{'wakeupduration': 2142, 'wakeupcount': 1, 'du...	1653753600	1653753600
11	84795018	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-28	1653814800	1653832800	{'wakeupduration': 2484, 'wakeupcount': 0, 'du...	1653832800	1653832800
12	86832247	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-29	1653897600	1653926400	{'wakeupduration': 2682, 'wakeupcount': 1, 'du...	1653926400	1653926400
13	14215648	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-30	1653980400	1653994800	{'wakeupduration': 2796, 'wakeupcount': 1, 'du...	1653994800	1653994800
14	31454868	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-05-31	1654052400	1654070400	{'wakeupduration': 2892, 'wakeupcount': 0, 'du...	1654070400	1654070400
15	87489117	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-06-01	1654146000	1654174800	{'wakeupduration': 1298, 'wakeupcount': 0, 'du...	1654174800	1654174800
16	83035559	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-06-02	1654243200	1654257600	{'wakeupduration': 110, 'wakeupcount': 0, 'dur...	1654257600	1654257600
17	67312014	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-06-03	1654311600	1654329600	{'wakeupduration': 2373, 'wakeupcount': 0, 'du...	1654329600	1654329600
18	43615706	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-06-04	1654401600	1654419600	{'wakeupduration': 2688, 'wakeupcount': 2, 'du...	1654419600	1654419600
19	85920406	America/Los_Angeles	16	93	d41d8cd98f00b204e9800998ecf8427e	2022-06-05	1654502400	1654524000	{'wakeupduration': 2813, 'wakeupcount': 1, 'du...	1654524000	1654524000

Now that we’ve got dataframes with data, we can see the data they have. In particular, we have a ton of measurements from a non-ScanWatch device, so we’ll get rid of those. Note that it is theoretically possible to have another ScanWatch device (and maybe the model name will be ScanWatch2, or something different), but we were unable to test this, so if you switch between devices you might want to be careful here.

[ ]:

hr_df = hr_df.drop(np.where(hr_df.model != 'ScanWatch')[0])
hr_df

	datetime	heart_rate	model	model_id	deviceid
1000	2022-05-20 00:00:00	82	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
1001	2022-05-20 00:10:00	84	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
1002	2022-05-20 00:20:00	87	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
1003	2022-05-20 00:40:00	92	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
1004	2022-05-20 00:50:00	95	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
...	...	...	...	...	...
3672	2022-06-08 22:30:00	87	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3673	2022-06-08 22:40:00	89	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3674	2022-06-08 23:00:00	96	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3675	2022-06-08 23:10:00	90	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e
3676	2022-06-08 23:30:00	89	ScanWatch	93	d41d8cd98f00b204e9800998ecf8427e

2677 rows × 5 columns

And now we have just 2.6k rows!

4. Data Visualization§

In this section, you’ll expect to see a couple plots from the Health Mate mobile app reproduced.

4.1: Heart rate single day§

First, we’ll reproduce the below plot you can find in the Health Mate mobile app that displays heart rate data over the course of a single day. This plot is super helpful because it allows you to view the heart rate signal at a very granular level, and see how it varies over the course of a few hours.

99d92ba6bf0f49ebb79c9ace7b70aaaf

Above is a plot taken from the official Withings mobile app.

[ ]:

#@title Enter date
date = "2022-05-25" #@param {type:"date"}
from matplotlib.ticker import FormatStrFormatter
import matplotlib.dates as dates
import matplotlib.transforms

from dateutil import tz
from scipy.interpolate import make_interp_spline
from datetime import timedelta

# measurements are taken every 10 minutes, displayed on app every 30
HEART_RATE_RECORDING_LENGTH = 30 * 60

with plt.style.context('dark_background'):
    # get the start and end times as timestamps by using the datetime library
    start_ts = datetime.strptime(date + ' 00:00:00-07:00', '%Y-%m-%d %H:%M:%S%z').timestamp()
    #end_ts = datetime.strptime(date + ' 23:59:59-07:00', '%Y-%m-%d %H:%M:%S%z').timestamp()
    end_ts = start_ts + 24 * 3600 + 30 * 60  # 00:30 the next day
    #end_ts = datetime.strptime('2022-05-25 00:30:00-07:00', '%Y-%m-%d %H:%M:%S%z').timestamp()

    # now find the indices in hr_df.timestamp that match as closely as possible
    start_idx = np.argmin(np.abs(hr_df.datetime.apply(lambda x: x.timestamp()) - start_ts))
    end_idx = np.argmin(np.abs(hr_df.datetime.apply(lambda x: x.timestamp()) - end_ts))

    x = hr_df.datetime.iloc[start_idx:end_idx]
    y = hr_df.heart_rate.iloc[start_idx:end_idx]

    # make it not as bumpy with gaussian filter. note that this does not reproduce
    # the curve exactly, as I'm not sure what smoothing algorithm they used...
    y = gaussian_filter(y, sigma=3)

    fig = plt.figure(figsize=(18,6), facecolor='black')

    x_timestamp = np.array([x_.timestamp() for x_ in x])

    # get the gaps. we include [6] as well because when you do np.diff,
    # it actually leaves out exactly one element
    differences = np.concatenate((np.diff(x_timestamp), [60 * 10]))

    # interpret a gap (i.e. when a user takes off the device for some prolonged
    # period of time) as any two measurements that are taken more than
    # 6 * 2 = 12 seconds apart, to account for minor variations around 6s
    gap_idxes = np.where(differences > HEART_RATE_RECORDING_LENGTH * 2)[0]

    # get the sleeps
    sleep_idxes = []
    for lower, upper in zip(sleeps_df.startdate, sleeps_df.enddate):
        if lower < start_ts or upper > end_ts:
            continue
        # get the location in the timestamp array that is closest to `lower`
        lower_idx = np.argmin(np.abs((x_timestamp - lower) - 0))
        # get the location in the timestamp array that is closest to `upper`
        upper_idx = np.argmin(np.abs((x_timestamp - upper) - 0))

        sleep_idxes.append((lower_idx, upper_idx))

    # first, we just plot the entire thing
    plt.plot(x, y, linewidth=3, color='#85a6f7')

    # now we overlay sleeps
    for sleep_start, sleep_end in sleep_idxes:
        plt.plot(x[sleep_start:sleep_end], y[sleep_start:sleep_end], linewidth=3, color='#3E414C')

        plt.axvline(x=x.iloc[sleep_start], linestyle='--', linewidth=1.5, color='#2F303A')

        plt.axvline(x=x.iloc[sleep_end], linestyle='--', linewidth=1.5, color='#2F303A')

    # now we overlay gaps by overlaying with white
    for gap_idx in gap_idxes:
        plt.plot(x[gap_idx:gap_idx+2], y[gap_idx:gap_idx+2], linewidth=5, color='black')

    plt.ylim(40, 180)
    plt.xlim(x.iloc[0], x.iloc[-1] - pd.Timedelta(minutes=30))

    datetimes = [
        datetime.strptime(date + ' 00:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 04:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 08:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 12:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 16:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 20:00:00-0700', '%Y-%m-%d %H:%M:%S%z'),
        datetime.strptime(date + ' 00:00:00-0700', '%Y-%m-%d %H:%M:%S%z') + timedelta(days=1)
    ]

    plt.xticks(ticks=datetimes, labels=['12AM', '4AM', '8AM', '12PM', '4PM', '8PM', '12AM'])

    # get the y-axis ticks to appear on the right
    plt.gca().yaxis.tick_right()

    #plt.gca().grid(axis='x', color='#2F303A')
    plt.gca().grid(axis='y', color='#2F303A', which='major')

    # hide x-axis and make the xtick labels 16 size
    plt.tick_params(
        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom=False,      # ticks along the bottom edge are off
        labelsize=16,
        labelcolor='silver'
    )

    # hide y-axis and make the ytick labels 16 size
    plt.tick_params(
        axis='y',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        right=False,      # ticks along the bottom edge are off
        labelsize=16,
        labelcolor='silver'
    )

    # add offset to y-axis tick labels to make them appear above gridlines
    # instead of to the side
    # https://stackoverflow.com/questions/28615887/how-to-move-a-tick-label-in-matplotlib
    dx = -50/72.; dy = 15/72.
    offset = matplotlib.transforms.ScaledTranslation(dx, dy, fig.dpi_scale_trans)
    for label in plt.gca().yaxis.get_majorticklabels():
        label.set_transform(label.get_transform() + offset)


    title_text = datetime.strftime(datetime.strptime(date, '%Y-%m-%d'), '%A, %B %d')

    plt.title(title_text, fontsize=25, pad=40)

    # turn off all borders
    plt.gca().spines['left'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['bottom'].set_visible(False)

../_images/notebooks_withings_scanwatch_16_0.png

Above is a plot we created ourselves!

Cool! Feel free to switch up the dates in the form as well, and see the user’s heart rate for any day data has been collected for!

4.2: Sleeps over the course of a week§

Now let’s try to reproduce the plot below, which shows when and how long the user sleeps every day of the week. This is super cool because we can look at the user’s sleeping patterns for an entire week at a glance, instead of having to sift through rows in a dataframe! We can do this by poking deeper into the Matplotlib architecture into the Artist layer and directly adding rounded rectangles to the plot.

eb1f62ec98e1409c9663a66b347068a2

The above plot is taken directly from the mobile app.

[ ]:

#@title Insert start of week (must be a Monday) and timezone name

week_start = "2022-05-23" #@param {type:"date"}
timezone_name = "America/Los_Angeles" #@param {type:"string"}

from matplotlib.ticker import FormatStrFormatter
import matplotlib.dates as dates
import matplotlib.transforms
import matplotlib.patches as patches
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)

from dateutil import tz
from scipy.interpolate import make_interp_spline

from datetime import timezone
import pytz

def timestamp_to_hour_min(timestamp, tz_name='America/Los_Angeles'):
    # convert timezone to '%H:%M' in local timezone

    local_tz = pytz.timezone(tz_name)
    return datetime.fromtimestamp(int(timestamp)).replace(tzinfo=timezone.utc).astimezone(local_tz).strftime('%H:%M')

def hour_min_to_vert_pos(hour_min):
    num_mins = (int(hour_min.split(':')[0]) * 60 + int(hour_min.split(':')[1]))
    vert_dist = (16 * 60 - num_mins) / (16 * 60)

    return vert_dist

# measurements are taken every 10 minutes, displayed on app every 30
HEART_RATE_RECORDING_LENGTH = 30 * 60

NUM_DAYS = 7
RECT_WIDTH = 0.02
RECTS_START = 0.1
RECTS_END = 0.9

# add on timezone
timezone_offset = datetime.now(pytz.timezone(timezone_name)).strftime('%z')
week_start = week_start + timezone_offset
week_start_ts = datetime.strptime(week_start, '%Y-%m-%d%z').timestamp()

with plt.style.context('dark_background'):
    fig, ax = plt.subplots(figsize=(18,6), facecolor='black')

    for day_num in range(NUM_DAYS):
        rect_center_pos = RECTS_START + day_num / (NUM_DAYS-1) * (RECTS_END - RECTS_START)
        # zorder to ensure it is drawn behind rect
        plt.axvline(x=rect_center_pos, linestyle='--', linewidth=1.5, color='#2F303A', zorder=0)

        # rounded rectangles
        # https://stackoverflow.com/questions/58425392/bar-chart-with-rounded-corners-in-matplotlib

        # subtract 4 hours because we want to capture sleeps that start from 8pm
        # and onward
        day_ts = week_start_ts + day_num * 24 * 3600 - 4 * 3600

        row = sleeps_df[np.logical_and(sleeps_df.startdate >= day_ts, sleeps_df.enddate <= day_ts + 24 * 3600)]

        # get the one with the longest sleep, this is how withings does it
        # source: https://support.withings.com/hc/en-us/community/posts/360026177173-Naps-don-t-count-for-sleep-tracking
        try:
            row = row.iloc[np.argmax(row.enddate - row.startdate)]
        except ValueError:
            continue

        # do a shift by 4 hours to account for difference b/w
        # 8PM and 12AM, a bit of a hack
        row.startdate += 4 * 3600
        row.enddate += 4 * 3600

        start, end = timestamp_to_hour_min(row.startdate, tz_name=timezone_name), timestamp_to_hour_min(row.enddate, tz_name=timezone_name)

        top_pos, bottom_pos = hour_min_to_vert_pos(start), hour_min_to_vert_pos(end)

        if row.data['sleep_score'] < 50:
            color = '#ff7455'
        elif row.data['sleep_score'] < 75:
            color = '#FEDF00'
        else:
            color = '#24ffa4'

        rect = patches.FancyBboxPatch((rect_center_pos - RECT_WIDTH / 2, bottom_pos),
                                      RECT_WIDTH, top_pos - bottom_pos,
                                      boxstyle="round,pad=-0.0040,rounding_size=0.015",
                                      linewidth=3, fc=color, ec='none')

        ax.add_patch(rect)

    plt.xticks(ticks=np.linspace(0.1,0.9,NUM_DAYS), labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
    plt.yticks(ticks=np.linspace(0, 1, 5), labels=['12PM', '8AM', '4AM', '12AM', '8PM'])

    # get the y-axis ticks to appear on the right
    plt.gca().yaxis.tick_right()

    plt.gca().yaxis.set_minor_locator(AutoMinorLocator(n=2))
    plt.gca().grid(axis='y', color='#2F303A', which='both')

    # hide x-axis and make the xtick labels 16 size
    plt.tick_params(
        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom=False,      # ticks along the bottom edge are off
        labelsize=16,
        labelcolor='silver'
    )

    # hide y-axis and make the ytick labels 16 size
    plt.tick_params(
        axis='y',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        right=False,      # ticks along the bottom edge are off
        labelsize=16,
        labelcolor='silver'
    )

    # add offset to y-axis tick labels to make them appear above gridlines
    # instead of to the side
    # https://stackoverflow.com/questions/28615887/how-to-move-a-tick-label-in-matplotlib
    dx = -50/72.; dy = 15/72.
    offset = matplotlib.transforms.ScaledTranslation(dx, dy, fig.dpi_scale_trans)
    for label in plt.gca().yaxis.get_majorticklabels():
        label.set_transform(label.get_transform() + offset)

    start_day_fmted = datetime.strftime(datetime.strptime(week_start, '%Y-%m-%d%z'), '%b %d')
    end_day_fmted = datetime.strftime(datetime.strptime(week_start, '%Y-%m-%d%z') + timedelta(days=6), '%b %d')
    plt.title(f'{start_day_fmted} - {end_day_fmted}', size=20, pad=40)

    plt.gca().set_axisbelow(True)
    # turn off all borders
    plt.gca().spines['left'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['bottom'].set_visible(False)

../_images/notebooks_withings_scanwatch_18_0.png

Above is a plot we created ourselves!

Looks like we were able to pretty accurately reproduce the same things you can see in the mobile app by querying the public API!

4.3: Watch wear time§

Now what if we want to check when the user has worn the watch? For example, which days of the month has the user worn the watch, and of the days the user did not wear the watch all day, how much?

Fortunately, this idea is easy to execute with the use of `july <https://github.com/e-hulten/july/>`__, a custom library that allows us to create beautiful calendar plots. Unfortunately, since the watch only samples the heart rate extremely sparsely (roughly every 10 minutes by default to preserve battery life), then it is hard to tell exactly when the watch is on or off. As a reasonable proxy of heart rate coverage throughout a particular day, we opt to count the number of hours in that day that the watch has logged at least 3 heart rate measurements for.

Note that we considered using the absolute number of heart rate measurements in a particular day as a proxy, but this is skewed by exercise periods, where the watch records heart rate at a much much higher sampling rate.

[ ]:

#@title Specify start and end dates and extract watch usage metrics
start_date = "2022-01-01" #@param {type:"date"}
end_date = "2022-06-01" #@param {type:"date"}

dates = date_range(start_date, end_date)
data = np.zeros((len(dates), 24))

# populate data array
for idx, date in tqdm(list(enumerate(dates))):
    local_tz = pytz.timezone('America/Los_Angeles')

    day_start = pd.Timestamp(date).replace(tzinfo=local_tz)

    for i in range(24):
        hour_start, hour_end = day_start + pd.Timedelta(hours=i), day_start + pd.Timedelta(hours=i+1)

        # check for number of measurements during this time
        bool_arr = np.logical_and(datetime.fromtimestamp(int(hour_start.timestamp())) < hr_df.datetime,
               hr_df.datetime < datetime.fromtimestamp(int(hour_end.timestamp())))

        num_measurements = len(np.where(bool_arr)[0])

        if num_measurements >= 3:
            data[idx, i] = 1

# A couple TODO's for modifying july to make it fit our usage better
# TODO 1: apply patch to this line https://github.com/e-hulten/july/blob/953ef3ee20549ca791d88d2f08e54862aa4c2f58/src/july/helpers.py#L118
# to make the fontsize for text values smaller
# TODO 2: apply patch to this line https://github.com/e-hulten/july/blob/master/src/july/plot.py#L149
# to make the weeknum_labels refer to the month day number instead

Now we make the calendar plot! The above cell calculated data. data is a 2D numpy array with axes (day, hour), where the first axis refers to a particular day and the second axis refers to a particular hour of the day. The value for, say data[130,8], is exactly one if the user wore his watch from 8am to 9am on day 130 (which is 130 days after start_date as specified above) and otherwise zero.

[ ]:

# sum across hour axis
july.calendar_plot(dates, data.sum(axis=1), value_label=True, weeknum_label=False, title=False)
plt.suptitle('Watch usage', fontsize="x-large", y=1.03)
plt.show()  # suppress output from above line

../_images/notebooks_withings_scanwatch_22_0.png

As we can see, our user has been fairly consistent about wearing his watch. He’s worn his watch for nearly every hour of the day every day.

If we want to check when the user has failed to wear his watch, we can simply look at the data array we populated. First, we get the start and end indices for watch measurements.

[ ]:

start_idx = np.where(data.sum(axis=1) > 0)[0][0]
end_idx = np.where(data.sum(axis=1) > 0)[0][-1]

print(f'Data collection started at index {start_idx}')
print(f'Data collection last performed at index {end_idx}')

Data collection started at index 138
Data collection last performed at index 151

Then, we exclude the first and last indices (because they have gone for half a day, so may skew the data) and look at which hours the user tends to miss for an average day of full usage.

[ ]:

def stylize_hour_num(hour):
    return datetime.strftime(datetime.strptime(str(hour), '%H'), ' %I %p').replace(' 0','')

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,6))
    plt.bar(np.arange(24), data[start_idx+1:end_idx].mean(axis=0))

    plt.title('Rate at which user wears watch at a particular hour of the day',
              pad=20, fontsize=15)

    plt.xticks(ticks=np.arange(24), labels=[stylize_hour_num(i) for i in range(24)], fontsize=8)

    plt.ylabel('Wear rate')
    plt.xlabel('Hour of the day')

../_images/notebooks_withings_scanwatch_26_0.png

As we can see, though the user sometimes takes his watch off, he still wears it 80% of the time in the worst case.

5. Data Analysis§

Data isn’t much without some analysis, so we’re going to do some in this section.

DISCLAIMER: the analyses below may not be 100% biologically or scientifically grounded; the code is here to assist in your process, if you are interested in asking these kinds of questions.

5.1: Heart rate vs. sleep period length§

Maybe the heart rate is correlated with how long a particular sleep period was. Let’s see if this hypothesis is true.

[ ]:

#@title Set date range and timezone
start = "2020-01-01" #@param {type:"date"}
end = "2022-05-22" #@param {type:"date"}
timezone = "US/Pacific" #@param {type:"string"}
params_all = {
    'start': f'{start}T00:00:00.000Z',
    'end': f'{end}T00:00:00.000Z'
}

First we get the length of each sleep in hours.

[ ]:

sleeps_df['Length (hours)'] = (sleeps_df.enddate - sleeps_df.startdate) / 3600

Next we get the median heart rate for each sleep.

[ ]:

measurement_timestamps = hr_df.datetime.apply(lambda x: x.timestamp())

all_heart_rates = []
median_heart_rates = []

for sleep_start, sleep_end in zip(sleeps_df.startdate, sleeps_df.enddate):
    idxes = np.where(np.logical_and(measurement_timestamps > sleep_start, measurement_timestamps < sleep_end))[0]

    heart_rates = np.array(hr_df.iloc[idxes].heart_rate)

    all_heart_rates.append(heart_rates)
    median_heart_rates.append(np.median(heart_rates))

sleeps_df['Median heart rate'] = median_heart_rates

/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

Let’s make a quick plot to get some intuition. Here we just use seaborn, as it’s very quick to get beautiful plots out with minimal effort.

[ ]:

with plt.style.context('seaborn'):
    p = sns.jointplot(x='Length (hours)', y='Median heart rate', data=sleeps_df, kind='reg')

../_images/notebooks_withings_scanwatch_34_0.png

As we can see from the scatterplot above, it looks like there might be a correlation there. Let’s compute $R^2$ just to see exactly how correlated.

We’ll follow this documentation and perform a linear regression to obtain the coefficient of determination.

[ ]:

from scipy import stats

# eliminate any nan's
filtered_sleeps = sleeps_df[~np.isnan(sleeps_df['Median heart rate'])]

slope, intercept, r_value, p_value, std_err = stats.linregress(filtered_sleeps['Length (hours)'],
                                                               filtered_sleeps['Median heart rate'])

print(f'Slope: {slope:.3g}')
print(f'Coefficient of determination: {r_value**2:.3g}')
print(f'p-value: {p_value:.3g}')

Slope: -0.645
Coefficient of determination: 0.849
p-value: 5.72e-08

We also see that the p-value, which is determined by scipy to be the two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero, is significant ($<$0.05).

So given this evidence from this particular data, maybe length of a sleep period is correlated with your heart rate.

5.2: Outlier Detection§

However, let us suppose that there is an outlier here. We’ll manually inject one, as one is currently not really present in the data, just to demo that we can deal with outliers here.

[ ]:

import copy
outlier = copy.deepcopy(sleeps_df.iloc[0])
outlier['Length (hours)'] = 8
outlier['Median heart rate'] = 120

filtered_sleeps = filtered_sleeps.append(outlier, ignore_index=True)

Now that we know that there is at least one outlier in this dataset, let’s implement a system based on a normal distribution to automatically detect and remove outliers like this, based on the method described here.

[ ]:

from scipy.stats import multivariate_normal

# calculate the covariance matrix
data = np.stack((filtered_sleeps['Length (hours)'], filtered_sleeps['Median heart rate']), axis=0)
covariance_matrix = np.cov(data)

# calculating the mean
mean_values = [np.mean(filtered_sleeps['Length (hours)']),np.mean(filtered_sleeps['Median heart rate'])]

# multivariate normal distribution
model = multivariate_normal(cov=covariance_matrix,mean=mean_values)
data = np.stack((filtered_sleeps['Length (hours)'], filtered_sleeps['Median heart rate']),axis=1)

# finding the outliers
# any point with a probability lower than the threshold value is considered an outlier and removed
threshold = 1.0e-05
outlier = model.pdf(data).reshape(-1) < threshold

new_data = copy.deepcopy(data)

for i, boolean in enumerate(outlier):
  if boolean:
    print(data[i], " is an outlier")
    print(np.where(np.logical_and(data[:,0] == data[i,0], data[:,1] == data[i,1]))[0].item(0))
    # delete outliers
    new_data = np.delete(new_data, np.where(np.logical_and(data[:,0] == data[i,0], data[:,1] == data[i,1]))[0].item(0),axis=0)

# plot new graph with outliers removed
# new_data[:,0] correspond to the new lengths, and new_data[:,1] correspond to the new values
p = sns.jointplot(x=new_data[:,0], y=new_data[:,1], kind='reg')

[  8. 120.]  is an outlier
18

../_images/notebooks_withings_scanwatch_40_1.png

As we can see, our outlier detection system found the outlier and excluded it from our dataset, as needed.