Google Fit: Guide to data extraction and analysis§

10ab654eb908434fba8cc50d91db2f24

A picture of the Google Fit Mobile Application

Google Fit is Google’s version of Apple Health. It lets you track your fitness activity and health data from all of your wearable devices like Apple Watches, Samsung Galaxy Watches, Polar Smartwatches, etc.

Google Fit is completely free. It also comes preloaded on Android Wear watches and can be downloaded from the Apple App and Google Play stores.

We’ve been using the Google Fit application for the past few weeks and we will show you how to extract its data, visualize your activities and compute correlations between multiple metrics of the data. The Google Fit API allows the users to extract all kinds of data on workouts and medical health. However, for this notebook, we will be focusing on metrics of the participant’s daily summary and activities such as steps, heart rate, workouts, etc.

We will be able to extract the following parameters:

Parameter Name

Sampling Frequency

Sleep Duration

Daily

Reproductive Health (Period Flow)

Daily

Move Minutes

Daily

Speed

Daily

Energy Expended

Daily

Blood Glucose

Per Minute

Oxygen Saturation

Per Minute

Steps

Per Minute

Blood Pressure

Per Minute

Body Temperature

Per Minute

Calories Consumed

Per Minute

Heart Rate

Every 5 seconds

In this guide, we sequentially cover the following five topics to extract data from Cronometer servers:

  1. Set up

  2. Authentication/Authorization

    • Requires only access_token, no OAuth.

  3. Data extraction

  • We get data via wearipedia in a couple lines of code

  1. Data Exporting

    • We export all of this data to file formats compatible by R, Excel, and MatLab.

  2. Adherence

    • We simulate non-adherence by dynamically removing datapoints from our simulated data.

  3. Visualization

    • We create a simple plot to visualize our data.

  4. Advanced visualization

    • 7.1 Visualizing participant’s Weekly Step Activity!

    • 7.2 Visualizing participant’s Weekly Heart Activity!

    • 7.3 Visualizing participant’s Detailed Heart Rate Breakdown!

  5. Data Analysis

  • 8.1 Analyzing correlation between Heart Rate and Number of Steps!

  1. Outlier Detection

  • 9.1 Highlighting Outliers!

Disclaimer: this notebook is purely for educational purposes. All of the data currently stored in this notebook is purely synthetic, meaning randomly generated according to rules we created. Despite this, the end-to-end data extraction pipeline has been tested on our own data, meaning that if you enter your own email and password on your own Colab instance, you can visualize your own real data. That being said, we were unable to thoroughly test the timezone functionality, though, since we only have one account, so beware.

Before starting, note that the Google Fit access token necessary to extract data only lasts for 1 hour. Thus, the researcher should fetch the data as soon as the participant provides the token.

1. Setup§

Participant Setup§

Dear Participant,

Once you download the Google Fit app, please set it up by following these resources: - Written guide: https://www.businessinsider.com/guides/tech/google-fit - Video guide: https://www.youtube.com/watch?v=0GnBgqnRM60&ab_channel=UponTop

Make sure that your phone is logged to the google fit app using the Google Fit login credentials (email and password) given to you by the data receiver.

Best,

Wearipedia

Data Receiver Setup§

Please follow the below steps:

  1. Create an email address for the participant, for example foo@email.com.

  2. Create a google fit account with the email foo@email.com and some random password.

  3. Keep foo@email.com and password stored somewhere safe.

  4. Distribute the device to the participant and instruct them to follow the participant setup letter above.

  5. Go to this link: https://developers.google.com/oauthplayground/

  6. Choose fitness API v1 in the Select & authorize APIs menu a7f2d65cdbcc4e3cbd1f993ad04af034

  7. Select all the different datatypes that you the researcher wants you to grant access to and click authorize APIs. Your clinical study coordinator might have more details regarding what sorts of health data they require. f7ea521791a944d1bb796194fb00f190

  8. Login with your account (the one the participant has connected with their Google Fit Account). 372c235a321a493e8e349d6667af8d92

  9. Click on ‘Continue’ 3e9cf5372845450d97df93644c071e4d

  10. Click on the exchange authorization code for tokens in Step2. 168fdb157aac43ae9164ff1c5d370922

  11. Copy and paste the access token from the Google Developer Playground and paste in the box in part section 2.1. 3381f4110ece4b108662c5f374a034b4

  12. Install the wearipedia Python package to easily extract data from this device via the Cronometer API.

[1]:
!pip install wearipedia
!pip install openpyxl
Collecting git+https://SaarthShah:****@github.com/SaarthShah/wearipedia.git
  Cloning https://SaarthShah:****@github.com/SaarthShah/wearipedia.git to /private/var/folders/4q/pymtr0qd38d5nlrw_myxq5r00000gn/T/pip-req-build-xxxgpibr
  Running command git clone --filter=blob:none --quiet 'https://SaarthShah:****@github.com/SaarthShah/wearipedia.git' /private/var/folders/4q/pymtr0qd38d5nlrw_myxq5r00000gn/T/pip-req-build-xxxgpibr
  Resolved https://SaarthShah:****@github.com/SaarthShah/wearipedia.git to commit b2f01ee96743f78da3cf6afff53e2e1a6b422567
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: pandas<2.0,>=1.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (1.5.2)
Requirement already satisfied: polyline<2.0.0,>=1.4.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (1.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.64.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (4.64.1)
Requirement already satisfied: wget<4.0,>=3.2 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (3.2)
Requirement already satisfied: rich<13.0.0,>=12.6.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (12.6.0)
Requirement already satisfied: garminconnect<0.2.0,>=0.1.48 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (0.1.49)
Requirement already satisfied: beautifulsoup4<5.0.0,>=4.11.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (4.11.1)
Requirement already satisfied: scipy<2.0,>=1.6 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (1.9.3)
Requirement already satisfied: myfitnesspal<3.0.0,>=2.0.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (2.0.1)
Requirement already satisfied: typer[all]<0.7.0,>=0.6.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from wearipedia==0.1.0) (0.6.1)
Requirement already satisfied: soupsieve>1.2 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from beautifulsoup4<5.0.0,>=4.11.1->wearipedia==0.1.0) (2.3.1)
Requirement already satisfied: requests in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2.28.2)
Requirement already satisfied: cloudscraper in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (1.2.66)
Requirement already satisfied: python-dateutil<3,>=2.4 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (2.8.2)
Requirement already satisfied: blessed<2.0,>=1.8.5 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.19.1)
Requirement already satisfied: browser-cookie3<1,>=0.16.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (0.16.3)
Requirement already satisfied: lxml<5,>=4.2.5 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (4.9.1)
Requirement already satisfied: measurement<4.0,>=3.2.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.2.0)
Requirement already satisfied: pytz>=2020.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from pandas<2.0,>=1.1->wearipedia==0.1.0) (2022.1)
Requirement already satisfied: numpy>=1.20.3 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from pandas<2.0,>=1.1->wearipedia==0.1.0) (1.21.2)
Requirement already satisfied: six>=1.8.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from polyline<2.0.0,>=1.4.0->wearipedia==0.1.0) (1.16.0)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from rich<13.0.0,>=12.6.0->wearipedia==0.1.0) (0.9.1)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from rich<13.0.0,>=12.6.0->wearipedia==0.1.0) (2.11.2)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from typer[all]<0.7.0,>=0.6.1->wearipedia==0.1.0) (8.0.4)
Requirement already satisfied: shellingham<2.0.0,>=1.3.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from typer[all]<0.7.0,>=0.6.1->wearipedia==0.1.0) (1.5.0)
Requirement already satisfied: colorama<0.5.0,>=0.4.3 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from typer[all]<0.7.0,>=0.6.1->wearipedia==0.1.0) (0.4.5)
Requirement already satisfied: wcwidth>=0.1.4 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from blessed<2.0,>=1.8.5->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (0.2.5)
Requirement already satisfied: pycryptodomex in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.16.0)
Requirement already satisfied: keyring in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (23.13.1)
Requirement already satisfied: SecretStorage in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.3.3)
Requirement already satisfied: lz4 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.1.3)
Requirement already satisfied: sympy>=0.7.3 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from measurement<4.0,>=3.2.0->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.10.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2022.9.24)
Requirement already satisfied: pyparsing>=2.4.7 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from cloudscraper->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (3.0.9)
Requirement already satisfied: requests-toolbelt>=0.9.1 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from cloudscraper->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (0.10.1)
Requirement already satisfied: mpmath>=0.19 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from sympy>=0.7.3->measurement<4.0,>=3.2.0->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.2.1)
Requirement already satisfied: jaraco.classes in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from keyring->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.2.3)
Requirement already satisfied: importlib-metadata>=4.11.4 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from keyring->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (4.13.0)
Requirement already satisfied: cryptography>=2.0 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from SecretStorage->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (37.0.1)
Requirement already satisfied: jeepney>=0.6 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from SecretStorage->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (0.8.0)
Requirement already satisfied: cffi>=1.12 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from cryptography>=2.0->SecretStorage->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.15.1)
Requirement already satisfied: zipp>=0.5 in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from importlib-metadata>=4.11.4->keyring->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (3.8.0)
Requirement already satisfied: more-itertools in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from jaraco.classes->keyring->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (9.0.0)
Requirement already satisfied: pycparser in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from cffi>=1.12->cryptography>=2.0->SecretStorage->browser-cookie3<1,>=0.16.1->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (2.21)
Requirement already satisfied: openpyxl in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (3.0.10)
Requirement already satisfied: et_xmlfile in /Users/saarth/opt/anaconda3/lib/python3.9/site-packages (from openpyxl) (1.1.0)

2. Authentication/Authorization§

To obtain access to data, authorization is required. All you’ll need to do here is just put in your access token for your Google Fit account. We’ll use this username and password to extract the data in the sections below.

Google Fit uses external devices to extract recorded activities, but it requires the participant to provide access tokens to access Google’s API and read fitness data from their account.

[2]:
#@title Enter the Participant's Access Token

google_auth_code = '4/0AVHEtk5zeKpZ1vH0v8iVBoxK9BFrG3k72Y7Ce61mEv_kDi7e5uzm2CmS8incOAbP8pXbvQ' #@param {type:"string"}
google_access_token = 'ya29.a0Ael9sCO_HuqZGAgii5Z5EqFQ0_GxI1D3vQsj5g1TGZsRsnY-s4FuaVWB8sB28uxTrYvJIAAAEpp4oJSSiYmYVMmC8jiIxm2FxgE6hyQQgijoe0JZVAhLg1FzmE8oODAZm1t3DvHVA9QxIta_NFZ4RnbjF0R8aCgYKAR8SARESFQF4udJhOejZSWJt-DccLKr94OaP5A0163'
print('Authorization Code: '+google_auth_code)
Authorization Code: 4/0AVHEtk5zeKpZ1vH0v8iVBoxK9BFrG3k72Y7Ce61mEv_kDi7e5uzm2CmS8incOAbP8pXbvQ

3. Data Extraction§

Data can be extracted via wearipedia, our open-source Python package that unifies dozens of complex wearable device APIs into one simple, common interface.

First, we’ll set a date range and then extract all of the data within that date range. You can select whether you would like synthetic data or not with the checkbox.

[17]:
#@title Enter start and end dates (in the format yyyy-mm-dd)

#set start and end dates - this will give you all the data from 2000-01-01 (January 1st, 2000) to 2100-02-03 (February 3rd, 2100), for example
start_date='2022-03-01' #@param {type:"string"}
end_date='2022-04-17' #@param {type:"string"}
synthetic = True #@param {type:"boolean"}
[18]:
import wearipedia

device = wearipedia.get_device("google/googlefit")

if not synthetic:
    device.authenticate({"authorization_code": google_auth_code})

params = {"start_date": start_date, "end_date": end_date}

steps = device.get_data("steps", params=params)
heart_rate = device.get_data("heart_rate", params=params)
sleep = device.get_data("sleep", params=params)
heart_minutes = device.get_data("heart_minutes", params=params)
blood_pressure = device.get_data("blood_pressure", params=params)
blood_glucose = device.get_data("blood_glucose", params=params)
body_temperature = device.get_data("body_temperature", params=params)
calories_expended = device.get_data("calories_expended", params=params)
activity_minutes = device.get_data("activity_minutes", params=params)
height = device.get_data("height", params=params)
oxygen_saturation = device.get_data("oxygen_saturation", params=params)
menstruation = device.get_data("menstruation", params=params)
speed = device.get_data("speed", params=params)
weight = device.get_data("weight", params=params)
distance = device.get_data("distance", params=params)

4. Data Exporting§

In this section, we export all of this data to formats compatible with popular scientific computing software (R, Excel, Google Sheets, Matlab). Specifically, we will first export to JSON, which can be read by R and Matlab. Then, we will export to CSV, which can be consumed by Excel, Google Sheets, and every other popular programming language.

Exporting to JSON (R, Matlab, etc.)§

Exporting to JSON is fairly simple. We export each datatype separately and also export a complete version that includes all simultaneously.

[20]:
import json


json.dump(steps, open("steps.json", "w"))
json.dump(heart_rate, open("heart_rate.json", "w"))
json.dump(sleep, open("sleep.json", "w"))
json.dump(heart_minutes, open("heart_minutes.json", "w"))
json.dump(blood_pressure, open("blood_pressure.json", "w"))
json.dump(blood_glucose, open("blood_glucose.json", "w"))
json.dump(body_temperature, open("body_temperature.json", "w"))
json.dump(calories_expended, open("calories_expended.json", "w"))
json.dump(activity_minutes, open("activity_minutes.json", "w"))
json.dump(oxygen_saturation, open("oxygen_saturation.json", "w"))
json.dump(height, open("height.json", "w"))
json.dump(menstruation, open("menstruation.json", "w"))
json.dump(speed, open("speed.json", "w"))
json.dump(weight, open("weight.json", "w"))
json.dump(distance, open("distance.json", "w"))

complete = {
    "steps": steps,
    "heart_rate": heart_rate,
    "sleep": sleep,
    "heart_minutes": heart_minutes,
    "blood_pressure": blood_pressure,
    "blood_glucose": blood_glucose,
    "body_temperature": body_temperature,
    "calories_expended": calories_expended,
    "activity_minutes": activity_minutes,
    "oxygen_saturation": oxygen_saturation,
    "height": height,
    "menstruation": menstruation,
    "speed": speed,
    "weight": weight,
    "distance": distance,
}

json.dump(complete, open("complete.json", "w"))

Feel free to open the file viewer (see left pane) to look at the outputs!

Exporting to CSV and XLSX (Excel, Google Sheets, R, Matlab, etc.)§

Exporting to CSV/XLSX requires a bit more processing, since they enforce a pretty restrictive schema.

We will thus export steps, heart rates, and breath rates all as separate files.

[21]:
import pandas as pd

steps_df = pd.DataFrame.from_dict(steps)
steps_df.to_csv('steps.csv')
steps_df.to_excel('steps.xlsx')

heart_rate_df = pd.DataFrame.from_dict(heart_rate)
heart_rate_df.to_csv('heart_rate.csv')
heart_rate_df.to_excel('heart_rate.xlsx')

sleep_df = pd.DataFrame.from_dict(sleep)
sleep_df.to_csv('sleep.csv')
sleep_df.to_excel('sleep.xlsx')

heart_minutes_df = pd.DataFrame.from_dict(heart_minutes)
heart_minutes_df.to_csv('heart_minutes.csv')
heart_minutes_df.to_excel('heart_minutes.xlsx')

blood_pressure_df = pd.DataFrame.from_dict(blood_pressure)
blood_pressure_df.to_csv('blood_pressure.csv')
blood_pressure_df.to_excel('blood_pressure.xlsx')

blood_glucose_df = pd.DataFrame.from_dict(blood_glucose)
blood_glucose_df.to_csv('blood_glucose.csv')
blood_glucose_df.to_excel('blood_glucose.xlsx')

body_temperature_df = pd.DataFrame.from_dict(body_temperature)
body_temperature_df.to_csv('body_temperature.csv')
body_temperature_df.to_excel('body_temperature.xlsx')

calories_expended_df = pd.DataFrame.from_dict(calories_expended)
calories_expended_df.to_csv('calories_expended.csv')
calories_expended_df.to_excel('calories_expended.xlsx')

activity_minutes_df = pd.DataFrame.from_dict(activity_minutes)
activity_minutes_df.to_csv('activity_minutes.csv')
activity_minutes_df.to_excel('activity_minutes.xlsx')

oxygen_saturation_df = pd.DataFrame.from_dict(oxygen_saturation)
oxygen_saturation_df.to_csv('oxygen_saturation.csv')
oxygen_saturation_df.to_excel('oxygen_saturation.xlsx')

height_df = pd.DataFrame.from_dict(height)
height_df.to_csv('height.csv')
height_df.to_excel('height.xlsx')

mensuration_df = pd.DataFrame.from_dict(menstruation)
mensuration_df.to_csv('mensuration.csv')
mensuration_df.to_excel('mensuration.xlsx')

speed_df = pd.DataFrame.from_dict(speed)
speed_df.to_csv('speed.csv')
speed_df.to_excel('speed.xlsx')

weight_df = pd.DataFrame.from_dict(weight)
weight_df.to_csv('weight.csv')
weight_df.to_excel('weight.xlsx')

distance_df = pd.DataFrame.from_dict(distance)
distance_df.to_csv('distance.csv')
distance_df.to_excel('distance.xlsx')

Again, feel free to look at the output files and download them.

5. Adherence§

The device simulator already automatically randomly deletes small chunks of the day. In this section, we will simulate non-adherence over longer periods of time from the participant (day-level and week-level).

Then, we will detect this non-adherence and give a Pandas DataFrame that concisely describes when the participant has had their device on and off throughout the entirety of the time period, allowing you to calculate how long they’ve had it on/off etc.

We will first delete a certain % of blocks either at the day level or week level, with user input.

[22]:
#@title Non-adherence simulation
block_level = "day" #@param ["day", "week"]
adherence_percent = 0.89 #@param {type:"slider", min:0, max:1, step:0.01}
[23]:
import numpy as np

if block_level == "day":
    block_length = 1
elif block_level == "week":
    block_length = 7

# This function will randomly remove datapoints from the
# data we have recieved from Cronometer based on the
# adherence_percent

def AdherenceSimulator(data):

  num_blocks = len(data) // block_length
  num_blocks_to_keep = int(adherence_percent * num_blocks)
  idxes = np.random.choice(np.arange(num_blocks), replace=False,
  size=num_blocks_to_keep)

  adhered_data = []

  for i in range(len(data)):
      if i in idxes:
          start = i * block_length
          end = (i + 1) * block_length
          for j in range(i,i+1):
            adhered_data.append(data[j])

  return adhered_data


# Adding adherence for all our datapoints
steps = AdherenceSimulator(steps)
heart_rate = AdherenceSimulator(heart_rate)
sleep = AdherenceSimulator(sleep)
heart_minutes = AdherenceSimulator(heart_minutes)
blood_pressure = AdherenceSimulator(blood_pressure)
blood_glucose = AdherenceSimulator(blood_glucose)
body_temperature = AdherenceSimulator(body_temperature)
calories_expended = AdherenceSimulator(calories_expended)
activity_minutes = AdherenceSimulator(activity_minutes)
oxygen_saturation = AdherenceSimulator(oxygen_saturation)
height = AdherenceSimulator(height)
menstruation = AdherenceSimulator(menstruation)
speed = AdherenceSimulator(speed)
weight = AdherenceSimulator(weight)
distance = AdherenceSimulator(distance)

And now we have significantly fewer datapoints! This will give us a more realistic situation, where participants may take off their device for days or weeks at a time.

Now let’s detect non-adherence. We will return a Pandas DataFrame sampled at every day.

[24]:
steps_df = pd.DataFrame.from_dict(steps)
heart_rate_df = pd.DataFrame.from_dict(heart_rate)
sleep_df = pd.DataFrame.from_dict(sleep)
heart_minutes_df = pd.DataFrame.from_dict(heart_minutes)
blood_pressure_df = pd.DataFrame.from_dict(blood_pressure)
blood_glucose_df = pd.DataFrame.from_dict(blood_glucose)
body_temperature_df = pd.DataFrame.from_dict(body_temperature)
calories_expended_df = pd.DataFrame.from_dict(calories_expended)
activity_minutes_df = pd.DataFrame.from_dict(activity_minutes)
oxygen_saturation_df = pd.DataFrame.from_dict(oxygen_saturation)
height_df = pd.DataFrame.from_dict(height)
mensuration_df = pd.DataFrame.from_dict(menstruation)
speed_df = pd.DataFrame.from_dict(speed)
weight_df = pd.DataFrame.from_dict(weight)
distance_df = pd.DataFrame.from_dict(distance)

We can plot this out, and we get adherence at one-day frequency throughout the entirety of the data collection period. For this chart we will plot WEIGHT consumed over the time period from the weight dataframe.

[25]:
import matplotlib.pyplot as plt
import datetime

dates = pd.date_range(start_date,end_date)

energy = []

def datacleanup(dataset):

    df = pd.DataFrame()

    for i in range(len(dataset)):
        milliseconds = dataset.iloc[i].get(0)['startTimeMillis']
        date = datetime.datetime.fromtimestamp(milliseconds/1000.0)
        try:
            df = pd.concat([df,pd.DataFrame.from_dict([{
                'date':str(date)[:10],
                'value':dataset.iloc[i].get(0)['point'][0]['value'][0]['fpVal']
            }])])
        except:

            df = pd.concat([df,pd.DataFrame.from_dict([{
                'date':str(date)[:10],
                'value':None
            }])])

    return df

weights = datacleanup(weight_df)

for d in dates:
    res = weights[weights.get('date')==str(d)[:10]]
    if len(res) == 0:
        energy.append(None)
    else:
        energy.append(res.iloc[0].value)

plt.figure(figsize=(12, 6))
plt.plot(dates, energy)
plt.show()
../_images/notebooks_google_fit_26_0.png

6. Visualization§

We’ve extracted lots of data, but what does it look like?

In this section, we will be visualizing our three kinds of data in a simple, customizable plot! This plot is intended to provide a starter example for plotting, whereas later examples emphasize deep control and aesthetics.

[26]:
#@title Basic Plot
feature = "Calories Expended" #@param ['Heart Minutes', 'Calories Expended']
start_date = "2022-03-04" #@param {type:"date"}
time_interval = "full time" #@param ["one week", "full time"]
smoothness = 0.02 #@param {type:"slider", min:0, max:1, step:0.01}
smooth_plot = True #@param {type:"boolean"}

import matplotlib.dates as mdates
import matplotlib.pyplot as plt

start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d')

if time_interval == "one week":
    day_idxes = [i for i,d in enumerate(dates) if d >= start_date and d <= start_date + timedelta(days=7)]
    end_date = start_date + timedelta(days=7)
elif time_interval == "full time":
    day_idxes = [i for i,d in enumerate(dates) if d >= start_date]
    end_date = dates[-1]

if feature == "Heart Minutes":
    hm = datacleanup(heart_minutes_df)
    concat_hm = []
    for i,d in enumerate(dates):
        day = d.strftime('%Y-%m-%d')
        if i in day_idxes:
          heart = hm[hm['date']==day]
          if len(heart) != 0:
              concat_hm += [(day,heart.iloc[0].value)]
          else:
              concat_hm += [(day,None)]
    ts = [x[0] for x in concat_hm]

    day_arr = [x[1] for x in concat_hm]

    sigma = 200 * smoothness

    title_fillin = "Weight"


if feature == "Calories Expended":
    ce = datacleanup(calories_expended_df)
    concat_data = []
    for i,d in enumerate(dates):
        day = d.strftime('%Y-%m-%d')
        if i in day_idxes:
          cals = ce[ce['date']==day]
          if len(cals) != 0:
              concat_data += [(day,cals.iloc[0].value)]
          else:
              concat_data += [(day,None)]

    ts = [x[0] for x in concat_data]

    day_arr = [x[1] for x in concat_data]

    sigma = 200 * smoothness

    title_fillin = "Weight"


with plt.style.context('ggplot'):
    fig, ax = plt.subplots(figsize=(15, 8))

    if smooth_plot:
        def to_numpy(day_arr):
            arr_nonone = [x for x in day_arr if x is not None]
            mean_val = int(np.mean(arr_nonone))
            for i,x in enumerate(day_arr):
                if x is None:
                    day_arr[i] = mean_val

            return np.array(day_arr)

        none_idxes = [i for i,x in enumerate(day_arr) if x is None]
        day_arr = to_numpy(day_arr)
        from scipy.ndimage import gaussian_filter
        day_arr = list(gaussian_filter(day_arr, sigma=sigma))
        for i, x in enumerate(day_arr):
            if i in none_idxes:
                day_arr[i] = None

    plt.plot(ts, day_arr)
    start_date_str = start_date.strftime('%Y-%m-%d')
    end_date_str = end_date.strftime('%Y-%m-%d')
    plt.title(f"{title_fillin} from {start_date_str} to {end_date_str}",
              fontsize=20)
    plt.xlabel("Date")
    plt.xticks(ts[::int(len(ts)/8)])
    plt.ylabel(title_fillin)
../_images/notebooks_google_fit_28_0.png

This plot allows you to quickly scan your data at many different time scales (week and full) and for different kinds of measurements (heart rate and weight), which enables easy and fast data exploration.

Furthermore, the smoothness parameter makes it easy to look for patterns in long-term trends.

7. Advanced Visualization§

Now we’ll do some more advanced plotting that at times features hardcore matplotlib hacking with the benefit of aesthetic quality.

7.1 Visualizing participant’s Weekly Step Activity!§

Let’s say you were interested in knowing how many steps you take in a day. If you had an iPhone you could go onto Apple Health and check out your step count that is being approximated by just your iphone’s built-in accelerometer. You would see your Weekly Steps chart using the following plot: d50e86637d384374a18fc23da6f094db Let’s recreate this for the your choice of week using the data that we have fetched from the Google Api!

Below, input the desired start and end dates for the plot above.

[30]:
#@title Set date range for the chart above

start = "2022-03-01" #@param {type:"date"}
end = "2022-03-07" #@param {type:"date"}

from datetime import  date

# A Dictionary to save the list of all the dates between end and start dates
step_plot_dates= {}

# Saving the end and start dates in a date format from the inputted strings
step_plot_start_date = date(int(start.split('-')[0]),int(start.split('-')[1]),
                            int(start.split('-')[2]))
step_plot_end_date = date(int(end.split('-')[0]),int(end.split('-')[1]),
                          int(end.split('-')[2]))

# Finding the list of all dates between our start and end dates
dates = list(pd.date_range(step_plot_start_date,step_plot_end_date,freq='d'))

# Dictionary to store the stepcount for each date
stepcount = {}

def datacleanup(dataset):

    df = pd.DataFrame()

    for i in range(len(dataset)):
        milliseconds = dataset.iloc[i].get(0)['startTimeMillis']
        date = datetime.datetime.fromtimestamp(milliseconds/1000.0)
        try:
            df = pd.concat([df,pd.DataFrame.from_dict([{
                'date':str(date)[:10],
                'value':dataset.iloc[i].get(0)['point'][0]['value'][0]['intVal']
            }])])
        except:

            df = pd.concat([df,pd.DataFrame.from_dict([{
                'date':str(date)[:10],
                'value':None
            }])])

    return df

steps_cleaned = datacleanup(steps_df)
stepcount = {}

# Loop to go over each date in our list
for date_val in dates:
    # Initializing each date in our dictionary as 0
    stepcount[date_val.day_name()[:3]+" ("+
            date_val.to_pydatetime().strftime('%Y-%m-%d')+")"] = 0
    d = str(date_val)[:10]
    res = steps_cleaned[steps_cleaned.date == d]
    if len(res) > 0:
        stepcount[date_val.day_name()[:3]+" ("+
            date_val.to_pydatetime().strftime('%Y-%m-%d')+")"] = res.iloc[0].value
# Counts the average steps in our plot and stores that as a formatted text
average_steps = '{:,}'.format(int(np.mean(list(stepcount.values()))))

# Saving the plot date range in the form of a string
date_range_text = (str(step_plot_start_date.day)+' '+
step_plot_start_date.strftime("%B")[:3]+' - '+
 str(step_plot_end_date.day)+' '+step_plot_end_date.strftime("%B")[:3]+
 ' '+ step_plot_start_date.strftime("%Y"))

# Creating the matptplotlib graph
plt1 = plt.figure(figsize=(16,8))
ax = plt1.gca()

# Adding grid lines to the chart
plt.grid(color="#a1a1a1", linestyle='--', linewidth=1, alpha = 0.2)

# Plotting our bars
plt.bar([key[:3] for key in stepcount.keys()],list(stepcount.values()),
        color="#FD4B03")

# Setting labels and titles
plt.ylabel("Step Count",color="#a1a1a1")

# Adding Step header
plt.text(0.15,1,"AVERAGE",fontsize=14,color='#89898B',
         transform=plt1.transFigure,horizontalalignment='center',
         weight='light')
plt.text(0.155,0.957,average_steps,fontsize=24,transform=plt1.transFigure,
         horizontalalignment='center')
plt.text(0.215,0.957,'steps',fontsize=18,transform=plt1.transFigure,
         horizontalalignment='center',color='#89898B')
plt.text(0.183,0.93,date_range_text,fontsize=14,color='#89898B',
        transform=plt1.transFigure, horizontalalignment='center', weight='light')


# Setting x and y ticks
plt.yticks([0,5000,10000,15000,20000,25000])
plt.xticks(color="#a1a1a1")

plt.show()
../_images/notebooks_google_fit_34_0.png

Above is a plot we created ourselves!

7.2 Visualizing participant’s Weekly Heart Activity!§

Similar to 7.1, if you were interested in checking out your heart rate values over the week then Apple Health would show your Weekly Heart Rate chart using the following plot: ea57b950771040e88e722e768937aa3f Let’s recreate this for the your choice of week using the data that we have fetched from the Google Api!

First, we will save the data that we fetched from the Google Fit API in the form of a DataFrame for us to easily work with that data!

[31]:
# Creating a dictionary to save all the heart rate values with the dates
heartrate_dict = {}

# Traversing through each entry in our DataFrame to save heart
# rate values for each date
for i in range(heart_rate_df.size-1):
    # Case when there is no data for a specific date
    try:
        heartrate_dict[datetime.datetime.fromtimestamp(
        int(heart_rate_df.iloc[i].get(0)['startTimeMillis'])// 1000).strftime(
            '%Y-%m-%d')] = (
        int(np.ceil(heart_rate_df.iloc[i].get(0)['point'][0]['value'][0]['fpVal'])),
        int(np.ceil(heart_rate_df.iloc[i].get(0)['point'][0]['value'][1]['fpVal'])),
        int(np.ceil(heart_rate_df.iloc[i].get(0)['point'][0]['value'][2]['fpVal'])))
    except:
        continue
# Creating a dictionary to save all the heart rate values with the dates between the specific end and start dates
heartrate_dict_weekly = {}

# Traversing each date for all the dates between start and end
for date_val in dates:
  # Initilizing each date with (0,0,0)
  heartrate_dict_weekly[date_val.day_name()[:3]+" ("+
                  date_val.to_pydatetime().strftime('%Y-%m-%d')+")"] = (0,0,0)
  # Saving actual high, low and avg values for dates that have data avaliable
  for key in heartrate_dict.keys():
    if (date_val.to_pydatetime().strftime('%Y-%m-%d') == key):
      heartrate_dict_weekly[date_val.day_name()[:3]+" ("+
      date_val.to_pydatetime().strftime('%Y-%m-%d')+")"] = heartrate_dict[key]

# This will help us find the low to max heart rate values for our chart header
bpm_range = str(min([i[2] for i in heartrate_dict_weekly.values()]))+' - '+ str(max([i[1] for i in heartrate_dict_weekly.values()]))

# Initializing the figure
plt2 = plt.figure(figsize=(16,8),facecolor='black')
ax = plt.gca()
ax.set_facecolor('#000000')


# Plotting the values
x = [key[:3] for key in list(heartrate_dict_weekly.keys())]
y = list(heartrate_dict_weekly.values())
plt.plot((range(len(x)),range(len(x))),([i[1] for i in y], [i[2] for i in y]),
         c='#FD4B03',lw=8,solid_capstyle='round')

# Setting y limit to the chart
plt.ylim(0,200)

# Setting x and y ticks
plt.xticks(range(len(x)),x,color="#a1a1a1")
plt.yticks(color="#a1a1a1")

# Setting labels
plt.ylabel('Heart Rate')

# Creating grid lines
plt.grid(color="#a1a1a1", linestyle='--', linewidth=2, alpha = 0.25)


# Adding Heart header
plt.text(0.14,1,"RANGE",fontsize=14,color='#89898B',transform=plt2.transFigure,
         horizontalalignment='center', weight='light')
plt.text(0.1625,0.957,bpm_range,fontsize=24,transform=plt2.transFigure,
         horizontalalignment='center', color = 'white')
plt.text(0.225,0.957,'BPM',fontsize=18,transform=plt2.transFigure,
         horizontalalignment='center',color='#89898B')
plt.text(0.183,0.93,date_range_text,fontsize=14,color='#89898B',
         transform=plt2.transFigure, horizontalalignment='center',
         weight='light')

plt.show()
../_images/notebooks_google_fit_38_0.png

Above is a plot we created ourselves!

8. Data Analysis§

Data isn’t much without some analysis, so we’re going to do some in this section.

DISCLAIMER: the analyses below may not be 100% biologically or scientifically grounded; the code is here to assist in your process, if you are interested in asking these kinds of questions.

Maybe the average heart rate is correlated with the number of steps you take in that time interval. Let’s test if this hypothesis is true. We will do so by plotting a jointplot between those two metrics and finding the correlation.

But before we get into that, let’s clean the dataframes to make sure the data that we have is ready for our analysis! We will first start with our step count dataset!

[32]:
# Creates a pandas dataframe with the date values inside the json body
analysis_df = steps_df.assign(date=steps_df.get(0).apply(
    lambda x: x['startTimeMillis']))

# Adds a column for steps to our original df
analysis_df = analysis_df.assign(steps = analysis_df.get(0).apply(
    lambda x: np.nan if len(x['point'])==0 else
     x['point'][0]['value'][0]['intVal']))

analysis_df.head()

[32]:
0 date steps
0 {'startTimeMillis': 1646092800000, 'endTimeMil... 1646092800000 9504
1 {'startTimeMillis': 1646179200000, 'endTimeMil... 1646179200000 6868
2 {'startTimeMillis': 1646265600000, 'endTimeMil... 1646265600000 10247
3 {'startTimeMillis': 1646352000000, 'endTimeMil... 1646352000000 19896
4 {'startTimeMillis': 1646438400000, 'endTimeMil... 1646438400000 14328

Now that we have our step count, we will repeat the process for our heart rate values and drop all the pairs where either of the values are Null (NaN).

[33]:

# Creates a pandas dataframe with the date values inside the json body heart_rate = heart_rate_df.assign(date=heart_rate_df.get(0).apply( lambda x: x['startTimeMillis'])) # Creates a pandas dataframe with the heart rate values inside the json body heart_rate = heart_rate.assign(heart_rate=heart_rate_df.get(0).apply( lambda x: np.nan if len(x['point'])==0 else x['point'][0]['value'][0]['fpVal'])) # Merging our step and heart rate datasets analysis_df = analysis_df.merge(heart_rate, on='date') # Dropping useless columns analysis_df.drop(columns=['0_x','0_y'],inplace=True) # Dropping all pairs of null values analysis_df_cleaned = analysis_df.dropna() # Replotting the dataframe for reference analysis_df_cleaned.head()
[33]:
date steps heart_rate
0 1646092800000 9504 117.9
1 1646179200000 6868 123.1
2 1646265600000 10247 142.4
3 1646352000000 19896 138.8
4 1646611200000 17366 143.1

Now that we have all our required values, let’s create a plot to see if there is a correlation between heart rate and steps

[34]:
import seaborn as sns
# Setting Seaborn plot style
sns.set_style("darkgrid")

#Plotting our data
plot = sns.jointplot(x='heart_rate', y='steps', data=analysis_df_cleaned,
                    kind='reg')

../_images/notebooks_google_fit_48_0.png

As we can see from the scatterplot above, it looks like there might be a correlation there. Let’s compute \(R^2\) just to see exactly how correlated.

We’ll follow this documentation and perform a linear regression to obtain the coefficient of determination (\(R^2\)).

[35]:
from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(
      analysis_df_cleaned.get('heart_rate'), analysis_df_cleaned.get('steps'))

print(f'Slope: {slope:.3g}')
print(f'Coefficient of determination: {r_value**2:.3g}')
print(f'p-value: {p_value:.3g}')
Slope: -43.1
Coefficient of determination: 0.00953
p-value: 0.571

As the p value is less than 85.9%, it means that that our result is not statistically significant evidence to conclude that there is a correlation between average heart rate and the total number of steps in a day.

9. Outlier Detection§

However, even though our p value does not seem to provide enough statistical significance that there is a correlation between average heart rate and the number of steps in a day, there might be outliers that do not follow any correlation. In this section of our analysis, we will find if there are outliers like that and if they exist, we will visually highlight them in our plot.

Before finding the individual outlier values, it would be interesting to see the summary of our step count and average heart rate parameters. It will give us a clear idea of what values are typical and which values can be considered atypical based on the data that we recieved from Google Fit.

[36]:
analysis_df_cleaned_summary = analysis_df_cleaned.describe().get(
                                          ['steps','heart_rate'])
analysis_df_cleaned_summary
[36]:
steps heart_rate
count 36.000000 36.000000
mean 11410.694444 121.144444
std 9299.564478 21.078268
min 0.000000 77.900000
25% 5136.500000 105.250000
50% 10397.000000 122.350000
75% 15596.750000 139.375000
max 38538.000000 158.500000

To locate the outliers we will be using a supervised as well as unsupervised algorithm called the Elliptic Envelope. In statistical studies, Elliptic Envelope created an imaginary elliptical area around a given dataset where values inside that imaginary area is considered to be normal data, and anything else is assumed to be outliers. It assumes that the given data follows a gaussian distribution.

“The main idea is to define the shape of the data and anomalies are those observations that lie far outside the shape. First a robust estimate of covariance of data is fitted into an ellipse around the central mode. Then, the Mahalanobis distance that is obtained from this estimate is used to define the threshold for determining outliers or anomalies.” (S. Shriram and E. Sivasankar ,2019, pp. 221-225)

[37]:
from sklearn.covariance import EllipticEnvelope
import copy

# Sometimes EllipticEnvelope shows slicing based copy warnings
# The next line changes a setting that prevents the error from happening

pd.set_option('mode.chained_assignment', None)

#create the model, set the contamination as 0.25
EE_model = EllipticEnvelope(contamination = 0.25)

#implement the model on the data
outliers = EE_model.fit_predict(analysis_df_cleaned.get(
                                                    ["steps", "heart_rate"]))

#extract the labels
analysis_df_cleaned["outlier"] = copy.deepcopy(outliers)

#change the labels
# We use -1 to mark an outlier and +1 for an inliner
analysis_df_cleaned["outlier"] = analysis_df_cleaned["outlier"].apply(
                                    lambda x: str(-1) if x == -1 else str(1))

#extract the score
analysis_df_cleaned["EE_scores"] = EE_model.score_samples(
                            analysis_df_cleaned.get(["steps", "heart_rate"]))

#print the value counts for inlier and outliers
print(analysis_df_cleaned["outlier"].value_counts())
1     27
-1     9
Name: outlier, dtype: int64

Below we will replot the analysis_df_cleaned dataframe to see how the two new columns were applied to it!

[38]:
analysis_df_cleaned.head()
[38]:
date steps heart_rate outlier EE_scores
0 1646092800000 9504 117.9 1 -0.064960
1 1646179200000 6868 123.1 1 -0.127376
2 1646265600000 10247 142.4 1 -1.254495
3 1646352000000 19896 138.8 -1 -4.420064
4 1646611200000 17366 143.1 1 -3.626586

Now that we have labeled the outliers as -1, let’s try to see which values of average heart rate and steps are being identified as outliers by our Elliptic Envelope Algorithm.

[39]:
outlier_df = analysis_df_cleaned[analysis_df_cleaned.get('outlier')=='-1'].get(
    ['steps','heart_rate'])
outlier_df
[39]:
steps heart_rate
3 19896 138.8
6 30735 98.8
9 23679 154.4
19 24861 135.0
23 19273 79.1
25 0 97.0
26 38538 123.8
30 13543 77.9
34 0 88.0

Sweet, now that we know that there were outliers in our dataset, let’s try to visually see which pair of values are being identified as outliers using a plot. Highlighting these outliers in a bright red color will make it super easy for us to identify them in our plot.

[40]:
# Setting Figure Size in Seaborn
sns.set(rc={'figure.figsize':(16,8)})

# Setting Seaborn plot style
sns.set_style("darkgrid")

#Plotting our data
plot = sns.regplot(x='heart_rate', y='steps', data=analysis_df_cleaned.drop(
    outlier_df.index))

plt.scatter(outlier_df.get('heart_rate'),outlier_df.get('steps'))
plt.scatter(outlier_df.get('heart_rate'),outlier_df.get('steps'),
            facecolors='red',alpha=.35, s=500)

plt.show()
../_images/notebooks_google_fit_63_0.png

Thus, the points highlighted in red are ones that seem to not be following the general trend of our dataset. Lastly, let’s see what the new p-value is after outlier removal!

[41]:
slope, intercept, r_value, p_value, std_err = stats.linregress(
    analysis_df_cleaned.drop(outlier_df.index).get('heart_rate'),
     analysis_df_cleaned.drop(outlier_df.index).get('steps'))

print(f'Slope: {slope:.3g}')
print(f'Coefficient of determination: {r_value**2:.3g}')
print(f'p-value: {p_value:.3g}')
Slope: -134
Coefficient of determination: 0.137
p-value: 0.0574

Our new p-value after removing any outliers is 0.197 which is much closer to 5% than before. Therefore, after removing the outliers, our result is getting closer to being statistically significant but there is yet not enough evidence to imply that that there is a correlation between average heart rate and the total number of steps in a day.