Cronometer: Guide to data extraction and analysis§

1cf2d91599774676a639ff63b75f1df4

A picture of the Cronometer Mobile Application

Do you know that DIET stands for Did I Eat That? Jokes aside, in this notebook we will be doing just that. We will connect to the Cronometer API to analyze the participants nutritional and exercise habits.

Cronometer is often regarded as the “best nutrition app for Android and iOS.” It lets you track your meals, count calories, identify gaps in your diet, and check your health metrics. It is also compatible with popular fitness devices like Fitbit, Strava, Withings, Garmin, Polar, Qardio and Oura.

Cronometer follows a fremium model offering a digital service accessible through its mobile applications (iOS and Android). Users also have an option to upgrade and unlock more advanced features like Custom Goals, Training Plans, Race Analysis, etc for a monthly fee of $8.99.

We’ve been using the cronometer application for the past few weeks and we will show you how to extract its data, visualize the participants data and compute correlations between multiple metrics. Cronometer does not have a public API, so we would reversed engineered the cornometer api within wearipedia in order to extract servings, daily-nutrition, exercises, notes, and biometrics.

We will be able to extract the following parameters:

Parameter Name

Sampling Frequency

Body Measurements

Per Entry

Blood Pressure

Per Entry / Day

Heart Rate

Per Entry / Day

Oxygen Saturation

Per Entry / Day

VO2 Max

Per Entry / Day

Pulse Wave Velocity

Per Entry / Day

Labs/ Test Results

Per Entry / Day

Mood

Per Entry / Day

Sleep

Per Entry / Day

Food Nutrition Breakdown

Per Serving

Exercise Minutes

Per Exercise Entry

Exercise Calories Burned

Per Exercise Entry

Daily Nutrition Breakdown

Per Day

In this guide, we sequentially cover the following nine topics to extract data from Cronometer servers:

  1. Set up

  2. Authentication/Authorization

    • Requires only username and password, no OAuth.

  3. Data extraction

  • We get data via wearipedia in a couple lines of code

  1. Data Exporting

    • We export all of this data to file formats compatible by R, Excel, and MatLab.

  2. Adherence

    • We simulate non-adherence by dynamically removing datapoints from our simulated data.

  3. Visualization

    • We create a simple plot to visualize our data.

  4. Advanced visualization

    • 7.1 Plotting the participants weight!

    • 7.2 Plotting participants Workout Minutes!

    • 7.3 Calorie Consumption Breakdown!

  5. Data Analysis

  • 8.1 Analyzing correlation between Protein Content and Vitamin B content in food items!

  1. Outlier Detection

  • 9.1 Highlighting Outliers!

Disclaimer: this notebook is purely for educational purposes. All of the data currently stored in this notebook is purely synthetic, meaning randomly generated according to rules we created. Despite this, the end-to-end data extraction pipeline has been tested on our own data, meaning that if you enter your own email and password on your own Colab instance, you can visualize your own real data. That being said, we were unable to thoroughly test the timezone functionality, though, since we only have one account, so beware.

1. Setup§

Participant Setup§

Dear Participant,

Once you download the cronometer app, please set it up by following these resources: - Written guide: https://support.cronometer.com/hc/en-us/articles/360021677792-Mobile-Quick-Start-Guide - Video guide: https://www.youtube.com/watch?v=XyeXp_wo0to&ab_channel=Cronometer

Make sure that your phone is logged to the cronometer app using the Cronometer login credentials (email and password) given to you by the data receiver.

Best,

Wearipedia

Data Receiver Setup§

Please follow the below steps:

  1. Create an email address for the participant, for example foo@email.com.

  2. Create a Cronometer account with the email foo@email.com and some random password.

  3. Keep foo@email.com and password stored somewhere safe.

  4. Request the participant to download the app and instruct them to follow the participant setup letter above.

  5. Install the wearipedia Python package to easily extract data from this app via the Cronometer API.

[1]:
!pip install wearipedia
!pip install openpyxl
!pip uninstall -y seaborn
!pip install seaborn==0.11.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://SaarthShah:****@github.com/SaarthShah/wearipedia.git
  Cloning https://SaarthShah:****@github.com/SaarthShah/wearipedia.git to /tmp/pip-req-build-u404yupq
  Running command git clone --filter=blob:none --quiet 'https://SaarthShah:****@github.com/SaarthShah/wearipedia.git' /tmp/pip-req-build-u404yupq
  Resolved https://SaarthShah:****@github.com/SaarthShah/wearipedia.git to commit b2f01ee96743f78da3cf6afff53e2e1a6b422567
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting garminconnect<0.2.0,>=0.1.48
  Downloading garminconnect-0.1.53.tar.gz (17 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas<2.0,>=1.1 in /usr/local/lib/python3.8/dist-packages (from wearipedia==0.1.0) (1.3.5)
Requirement already satisfied: tqdm<5.0.0,>=4.64.1 in /usr/local/lib/python3.8/dist-packages (from wearipedia==0.1.0) (4.64.1)
Collecting beautifulsoup4<5.0.0,>=4.11.1
  Downloading beautifulsoup4-4.11.2-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.4/129.4 KB 6.0 MB/s eta 0:00:00
Requirement already satisfied: scipy<2.0,>=1.6 in /usr/local/lib/python3.8/dist-packages (from wearipedia==0.1.0) (1.7.3)
Collecting polyline<2.0.0,>=1.4.0
  Downloading polyline-1.4.0-py2.py3-none-any.whl (4.4 kB)
Collecting typer[all]<0.7.0,>=0.6.1
  Downloading typer-0.6.1-py3-none-any.whl (38 kB)
Collecting myfitnesspal<3.0.0,>=2.0.1
  Downloading myfitnesspal-2.0.1-py3-none-any.whl (29 kB)
Collecting rich<13.0.0,>=12.6.0
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 237.5/237.5 KB 23.4 MB/s eta 0:00:00
Collecting wget<4.0,>=3.2
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... done
Collecting soupsieve>1.2
  Downloading soupsieve-2.4-py3-none-any.whl (37 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2.25.1)
Collecting cloudscraper
  Downloading cloudscraper-1.2.68-py2.py3-none-any.whl (98 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.6/98.6 KB 10.6 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil<3,>=2.4 in /usr/local/lib/python3.8/dist-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (2.8.2)
Requirement already satisfied: lxml<5,>=4.2.5 in /usr/local/lib/python3.8/dist-packages (from myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (4.9.2)
Collecting blessed<2.0,>=1.8.5
  Downloading blessed-1.20.0-py2.py3-none-any.whl (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.4/58.4 KB 7.3 MB/s eta 0:00:00
Collecting measurement<4.0,>=3.2.0
  Downloading measurement-3.2.2-py3-none-any.whl (17 kB)
Collecting browser-cookie3<1,>=0.16.1
  Downloading browser_cookie3-0.17.0-py3-none-any.whl (13 kB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<2.0,>=1.1->wearipedia==0.1.0) (2022.7.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from pandas<2.0,>=1.1->wearipedia==0.1.0) (1.21.6)
Requirement already satisfied: six>=1.8.0 in /usr/local/lib/python3.8/dist-packages (from polyline<2.0.0,>=1.4.0->wearipedia==0.1.0) (1.15.0)
Requirement already satisfied: typing-extensions<5.0,>=4.0.0 in /usr/local/lib/python3.8/dist-packages (from rich<13.0.0,>=12.6.0->wearipedia==0.1.0) (4.4.0)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 5.9 MB/s eta 0:00:00
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.8/dist-packages (from rich<13.0.0,>=12.6.0->wearipedia==0.1.0) (2.6.1)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.8/dist-packages (from typer[all]<0.7.0,>=0.6.1->wearipedia==0.1.0) (7.1.2)
Collecting shellingham<2.0.0,>=1.3.0
  Downloading shellingham-1.5.0.post1-py2.py3-none-any.whl (9.4 kB)
Collecting colorama<0.5.0,>=0.4.3
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: wcwidth>=0.1.4 in /usr/local/lib/python3.8/dist-packages (from blessed<2.0,>=1.8.5->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (0.2.6)
Collecting pycryptodomex
  Downloading pycryptodomex-3.17-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 57.3 MB/s eta 0:00:00
Collecting lz4
  Downloading lz4-4.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 75.5 MB/s eta 0:00:00
Requirement already satisfied: sympy in /usr/local/lib/python3.8/dist-packages (from measurement<4.0,>=3.2.0->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.7.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (1.24.3)
Requirement already satisfied: pyparsing>=2.4.7 in /usr/local/lib/python3.8/dist-packages (from cloudscraper->garminconnect<0.2.0,>=0.1.48->wearipedia==0.1.0) (3.0.9)
Collecting requests-toolbelt>=0.9.1
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 KB 7.7 MB/s eta 0:00:00
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.8/dist-packages (from sympy->measurement<4.0,>=3.2.0->myfitnesspal<3.0.0,>=2.0.1->wearipedia==0.1.0) (1.2.1)
Building wheels for collected packages: wearipedia, garminconnect, wget
  Building wheel for wearipedia (pyproject.toml) ... done
  Created wheel for wearipedia: filename=wearipedia-0.1.0-py3-none-any.whl size=86901 sha256=3b40e2188eda7cc704951bc7fa3944a9b92c879e7c13e8967b1c0bb239a0bdfa
  Stored in directory: /tmp/pip-ephem-wheel-cache-5558qoev/wheels/90/0c/6b/58c50ebec3c57b5a167fc60f6382f970412510e67ed0768a03
  Building wheel for garminconnect (setup.py) ... done
  Created wheel for garminconnect: filename=garminconnect-0.1.53-py3-none-any.whl size=13498 sha256=c214e5fc637746d033c956873cc71c64840d482d0507b2e55aa0f223a1e239f4
  Stored in directory: /root/.cache/pip/wheels/9e/0a/ed/06a245135409c4720383c237b7e98906880834704edb4fc3e7
  Building wheel for wget (setup.py) ... done
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=2644e86b3c8f4a04251d7fd87453ace2c2e367cb3745844e799eb15e106763f3
  Stored in directory: /root/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wearipedia garminconnect wget
Installing collected packages: wget, commonmark, typer, soupsieve, shellingham, rich, pycryptodomex, polyline, lz4, colorama, blessed, requests-toolbelt, measurement, browser-cookie3, beautifulsoup4, myfitnesspal, cloudscraper, garminconnect, wearipedia
  Attempting uninstall: typer
    Found existing installation: typer 0.7.0
    Uninstalling typer-0.7.0:
      Successfully uninstalled typer-0.7.0
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed beautifulsoup4-4.11.2 blessed-1.20.0 browser-cookie3-0.17.0 cloudscraper-1.2.68 colorama-0.4.6 commonmark-0.9.1 garminconnect-0.1.53 lz4-4.3.2 measurement-3.2.2 myfitnesspal-2.0.1 polyline-1.4.0 pycryptodomex-3.17 requests-toolbelt-0.10.1 rich-12.6.0 shellingham-1.5.0.post1 soupsieve-2.4 typer-0.6.1 wearipedia-0.1.0 wget-3.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: openpyxl in /usr/local/lib/python3.8/dist-packages (3.0.10)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.8/dist-packages (from openpyxl) (1.1.0)
Found existing installation: seaborn 0.11.2
Uninstalling seaborn-0.11.2:
  Successfully uninstalled seaborn-0.11.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seaborn==0.11.1
  Downloading seaborn-0.11.1-py3-none-any.whl (285 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 285.0/285.0 KB 12.9 MB/s eta 0:00:00
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.8/dist-packages (from seaborn==0.11.1) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.8/dist-packages (from seaborn==0.11.1) (1.7.3)
Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.8/dist-packages (from seaborn==0.11.1) (1.3.5)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packages (from seaborn==0.11.1) (1.21.6)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn==0.11.1) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn==0.11.1) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn==0.11.1) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn==0.11.1) (3.0.9)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.23->seaborn==0.11.1) (2022.7.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.1->matplotlib>=2.2->seaborn==0.11.1) (1.15.0)
Installing collected packages: seaborn
Successfully installed seaborn-0.11.1

2. Authentication/Authorization§

To obtain access to data, authorization is required. All you’ll need to do here is just put in your email and password for your Cronometer account. We’ll use this username and password to extract the data in the sections below.

[2]:
#@title Enter Cronometer login credentials
email_address = "saarth@berkeley.edu" #@param {type:"string"}
password = "password" #@param {type:"string"}

3. Data Extraction§

Data can be extracted via wearipedia, our open-source Python package that unifies dozens of complex wearable device APIs into one simple, common interface.

First, we’ll set a date range and then extract all of the data within that date range. You can select whether you would like synthetic data or not with the checkbox.

[3]:
#@title Enter start and end dates (in the format yyyy-mm-dd)

#set start and end dates - this will give you all the data from 2000-01-01 (January 1st, 2000) to 2100-02-03 (February 3rd, 2100), for example
start_date='2022-03-01' #@param {type:"string"}
end_date='2022-09-17' #@param {type:"string"}
synthetic = True #@param {type:"boolean"}
[4]:
import wearipedia

device = wearipedia.get_device("cronometer/cronometer")

if not synthetic:
    device.authenticate({"username": email_address, "password": password})

params = {"start_date": start_date, "end_date": end_date}

dailySummary = device.get_data("dailySummary", params=params)
servings = device.get_data("servings", params=params)
exercises = device.get_data("exercises", params=params)
biometrics = device.get_data("biometrics", params=params)

4. Data Exporting§

In this section, we export all of this data to formats compatible with popular scientific computing software (R, Excel, Google Sheets, Matlab). Specifically, we will first export to JSON, which can be read by R and Matlab. Then, we will export to CSV, which can be consumed by Excel, Google Sheets, and every other popular programming language.

Exporting to JSON (R, Matlab, etc.)§

Exporting to JSON is fairly simple. We export each datatype separately and also export a complete version that includes all simultaneously.

[6]:
import json


json.dump(dailySummary, open("dailySummary.json", "w"))
json.dump(servings, open("servings.json", "w"))
json.dump(exercises, open("exercises.json", "w"))
json.dump(biometrics, open("biometrics.json", "w"))

complete = {
    "dailySummary": dailySummary,
    "servings": servings,
    "exercises": exercises,
    "biometrics": biometrics
}

json.dump(complete, open("complete.json", "w"))

Feel free to open the file viewer (see left pane) to look at the outputs!

Exporting to CSV and XLSX (Excel, Google Sheets, R, Matlab, etc.)§

Exporting to CSV/XLSX requires a bit more processing, since they enforce a pretty restrictive schema.

We will thus export steps, heart rates, and breath rates all as separate files.

[7]:
import pandas as pd

dailySummary_df = pd.DataFrame.from_dict(dailySummary)

dailySummary_df.to_csv('dailySummary.csv')
dailySummary_df.to_excel('dailySummary.xlsx')

servings_df = pd.DataFrame.from_dict(servings)

servings_df.to_csv('servings.csv', index=False)
servings_df.to_excel('servings.xlsx', index=False)

exercises_df = pd.DataFrame.from_dict(exercises)

exercises_df.to_csv('exercises.csv', index=False)
exercises_df.to_excel('exercises.xlsx', index=False)

biometrics_df = pd.DataFrame.from_dict(biometrics)

biometrics_df.to_csv('biometrics.csv', index=False)
biometrics_df.to_excel('biometrics.xlsx', index=False)

Again, feel free to look at the output files and download them.

5. Adherence§

The device simulator already automatically randomly deletes small chunks of the day. In this section, we will simulate non-adherence over longer periods of time from the participant (day-level and week-level).

Then, we will detect this non-adherence and give a Pandas DataFrame that concisely describes when the participant has had their device on and off throughout the entirety of the time period, allowing you to calculate how long they’ve had it on/off etc.

We will first delete a certain % of blocks either at the day level or week level, with user input.

[8]:
#@title Non-adherence simulation
block_level = "day" #@param ["day", "week"]
adherence_percent = 0.89 #@param {type:"slider", min:0, max:1, step:0.01}
[9]:
import numpy as np

if block_level == "day":
    block_length = 1
elif block_level == "week":
    block_length = 7



# This function will randomly remove datapoints from the
# data we have recieved from Cronometer based on the
# adherence_percent

def AdherenceSimulator(data):

  num_blocks = len(data) // block_length
  num_blocks_to_keep = int(adherence_percent * num_blocks)
  idxes = np.random.choice(np.arange(num_blocks), replace=False,
  size=num_blocks_to_keep)

  adhered_data = []

  for i in range(len(data)):
      if i in idxes:
          start = i * block_length
          end = (i + 1) * block_length
          for j in range(i,i+1):
            adhered_data.append(data[j])

  return adhered_data


# Adding adherence for daily summary

dailySummary = AdherenceSimulator(dailySummary)

# Adding adherence for exercises

exercises = AdherenceSimulator(exercises)

# Adding adherence for biometrics

biometrics = AdherenceSimulator(biometrics)

# Adding adherence for servings

servings = AdherenceSimulator(servings)

And now we have significantly fewer datapoints! This will give us a more realistic situation, where participants may take off their device for days or weeks at a time.

Now let’s detect non-adherence. We will return a Pandas DataFrame sampled at every day.

[10]:
dailySummary_df = pd.DataFrame.from_dict(dailySummary)
servings_df = pd.DataFrame.from_dict(servings)
exercises_df = pd.DataFrame.from_dict(exercises)
biometrics_df = pd.DataFrame.from_dict(biometrics)

We can plot this out, and we get adherence at one-day frequency throughout the entirety of the data collection period. For this chart we will plot Energy consumed over the time period from the dailySummary dataframe.

[11]:
import matplotlib.pyplot as plt
import datetime

dates = pd.date_range(start_date,end_date)

energy = []

for d in dates:
  res = dailySummary_df[dailySummary_df.Date == datetime.datetime.strftime(d,
  '%Y-%m-%d')]['Energy (kcal)']
  if len(res) == 0:
    energy.append(None)
  else:
    energy.append(res.iloc[0])

plt.figure(figsize=(12, 6))
plt.plot(dates, energy)
plt.show()
../_images/notebooks_cronometer_25_0.png

6. Visualization§

We’ve extracted lots of data, but what does it look like?

In this section, we will be visualizing our three kinds of data in a simple, customizable plot! This plot is intended to provide a starter example for plotting, whereas later examples emphasize deep control and aesthetics.

[12]:
#@title Basic Plot
feature = "Heart Rate (Apple Health)" #@param ['Heart Rate (Apple Health)', 'Weight']
start_date = "2022-03-04" #@param {type:"date"}
time_interval = "full time" #@param ["one week", "full time"]
smoothness = 0.02 #@param {type:"slider", min:0, max:1, step:0.01}
smooth_plot = True #@param {type:"boolean"}

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

start_date = datetime.strptime(start_date, '%Y-%m-%d')

if time_interval == "one week":
    day_idxes = [i for i,d in enumerate(dates) if d >= start_date and d <= start_date + timedelta(days=7)]
    end_date = start_date + timedelta(days=7)
elif time_interval == "full time":
    day_idxes = [i for i,d in enumerate(dates) if d >= start_date]
    end_date = dates[-1]

if feature == "Weight":
    weights = biometrics_df[biometrics_df['Metric']=='Weight']
    concat_weight = []
    for i,d in enumerate(dates):
        day = d.strftime('%Y-%m-%d')
        if i in day_idxes:
          weight = weights[weights['Day']==day]
          if len(weight) != 0:
              concat_weight += [(day,weight.iloc[0].Amount)]
          else:
              concat_weight += [(day,None)]
    ts = [x[0] for x in concat_weight]

    day_arr = [x[1] for x in concat_weight]

    sigma = 200 * smoothness

    title_fillin = "Weight"


if feature == 'Heart Rate (Apple Health)':
    hrs = biometrics_df[biometrics_df['Metric']=='Heart Rate (Apple Health)']
    concat_hr = []
    for i,d in enumerate(dates):
        day = d.strftime('%Y-%m-%d')
        if i in day_idxes:
          hr = hrs[hrs['Day']==day]
          if len(hr) != 0:
              concat_hr += [(day,hr.iloc[0].Amount)]
          else:
              concat_hr += [(day,None)]
    ts = [x[0] for x in concat_hr]

    day_arr = [x[1] for x in concat_hr]

    sigma = 200 * smoothness

    title_fillin = "Weight"


with plt.style.context('ggplot'):
    fig, ax = plt.subplots(figsize=(15, 8))

    if smooth_plot:
        def to_numpy(day_arr):
            arr_nonone = [x for x in day_arr if x is not None]
            mean_val = int(np.mean(arr_nonone))
            for i,x in enumerate(day_arr):
                if x is None:
                    day_arr[i] = mean_val

            return np.array(day_arr)

        none_idxes = [i for i,x in enumerate(day_arr) if x is None]
        day_arr = to_numpy(day_arr)
        from scipy.ndimage import gaussian_filter
        day_arr = list(gaussian_filter(day_arr, sigma=sigma))
        for i, x in enumerate(day_arr):
            if i in none_idxes:
                day_arr[i] = None

    plt.plot(ts, day_arr)
    start_date_str = start_date.strftime('%Y-%m-%d')
    end_date_str = end_date.strftime('%Y-%m-%d')
    plt.title(f"{title_fillin} from {start_date_str} to {end_date_str}",
              fontsize=20)
    plt.xlabel("Date")
    plt.xticks(ts[::int(len(ts)/8)])
    plt.ylabel(title_fillin)
../_images/notebooks_cronometer_27_0.png

This plot allows you to quickly scan your data at many different time scales (week and full) and for different kinds of measurements (heart rate and weight), which enables easy and fast data exploration.

Furthermore, the smoothness parameter makes it easy to look for patterns in long-term trends.

7. Advanced Visualization§

Now we’ll do some more advanced plotting that at times features hardcore matplotlib hacking with the benefit of aesthetic quality.

7.1 Calorie Consumption Breakdown§

Cronometer App is particularly known for its meal tracking features. If a user went onto their Trends, they will be able to see a day-wise calorie breakdown like the one shown below.

09fccdd87aad46b6b3d23002372abda4

Above is a plot from the mobile app itself!

This is a detailed breakdown which shows a user how many kilocalories they are consuming in Proteins, Carbohydrates, Fats and Alcohols.

The Cronometer API also gave us access to the participants daily nutrition data. This would enable us to exactly re-create this chart using Python.

[15]:
#@title Enter the Start Date

year_string = '2022' #@param {type:"string"}
month_string = '05' #@param {type:"string"}
day_string = '15' #@param{type:"string"}

start_date = year_string+'-'+month_string+'-'+day_string

[16]:

#@title Enter the End Date and plotting year_string = '2022' #@param {type:"string"} month_string = '05' #@param {type:"string"} day_string = '28' #@param{type:"string"} end_date = year_string+'-'+month_string+'-'+day_string from matplotlib import lines as mlines nutrition = dailySummary_df test = nutrition[nutrition.get('Date')>=start_date].get(['Date','Carbs (g)','Fat (g)','Protein (g)','Alcohol (g)','Energy (kcal)']) test = test[test.get('Date')<=end_date] # Fixing the Date Values for the xticks in our chart test = test.assign(Date_formatted = test.get('Date').apply(lambda x: date_fixer(x).split()[1]+' '+date_fixer(x).split()[0])) # 1g Carbs is equal to 4 kcal test = test.assign(Carbs = test.get('Carbs (g)')*4) # 1g Proteins is equal to 4 kcal test = test.assign(Protein = test.get('Protein (g)')*4) # 1g Fats is equal to 9 kcal test = test.assign(Fat = test.get('Fat (g)')*9) # 1g Alcohol is equal to 7 kcal test = test.assign(Alcohol = test.get('Alcohol (g)')*7) # Creating a matplotlib plot of size 16,8 fig3 = plt.figure(figsize=(7,10),facecolor='#1C1D21') ax = fig3.gca() ax.set_facecolor('#1C1D21') ax.set_axisbelow(True) # Adding header to the chart header_text = (test.iloc[0].get('Date_formatted')+', '+ test.iloc[-1].get('Date').split('-')[0]+' - '+ test.iloc[-1].get('Date_formatted')+', '+ test.iloc[0].get('Date').split('-')[0]) plt.title('Energy Consumed (kcal)',y=1.06, color='white',fontsize=24) plt.suptitle(header_text,y=0.92, color='white',fontsize=16) # Setting grid plt.grid(axis='y', linewidth=2) # Plotting the values for Fats plt.bar(test.get('Date'),test.get('Fat'), color='#DB4B44', width=0.6) # Plotting the values for Carbohydrates plt.bar(test.dropna().get('Date'),test.dropna().get('Carbs'),color='#529BE3', width=0.6, bottom = list(test.dropna().get('Fat'))) # Plotting the values for Protein plt.bar(test.dropna().get('Date'),test.dropna().get('Protein'),color='#5C9851', width=0.6, bottom = [l+w for l,w in zip(test.dropna().get('Fat'), test.dropna().get('Carbs'))]) # Plotting the values for Alcohol plt.bar(test.dropna().get('Date'),test.dropna().get('Alcohol'),color='#F2C04F', width=0.6, bottom = [l+w+b for l,w,b in zip(test.dropna().get('Fat'), test.dropna().get('Carbs'),test.dropna().get('Alcohol'))]) # Setting x and y ticks plt.xticks(test.get('Date')[::3],list(test.get('Date_formatted'))[::3], color='white', fontsize= 12) plt.yticks([0,250,500,750,1000,1250,1500,1750,2000,2250], fontsize= 12, color='white') #Plotting the legend rect1 = mlines.Line2D([], [], marker=None, markersize=30, linewidth=5, color="#DB4B44") rect2 = mlines.Line2D([], [], marker=None, markersize=30, linewidth=5, color="#529BE3") rect3 = mlines.Line2D([], [], marker=None, markersize=30, linewidth=5, color="#5C9851") rect4 = mlines.Line2D([], [], marker='None', markersize=30, linewidth=5, color="#F2C04F") leg = ax.legend((rect1, rect2,rect3,rect4), ("Fat (kcal)", "Carbohydrates (kcal)", "Protein (kcal)", "Alcohol (kcal)"), bbox_to_anchor=(0.5,-0.2), loc="center", frameon=False, ncol=2, markerscale=1.5, fontsize=16 , labelspacing=1) # Setting color for all the labels to white for text in leg.get_texts(): text.set_color("white") # Showing the chart plt.show()
../_images/notebooks_cronometer_36_0.png

^ Above is a plot we created ourselves!

7.2 Plotting the participants weight!§

If the participant has a weighing device that connects with Cronometer/Apple Health automatically or the user has been manually entering their weight in the Cronometer app, they can go to the Trends tab on the Cronometer app to see a graph for their weight over time. Below is a screenshot from the Cronometer app that shows a user’s weight.

fb63ead73eac4f0c81dc922378de2610

Above is a plot from the mobile app itself!

[13]:
#@title Set date range for the chart above

start = "2022-05-15" #@param {type:"date"}
end = "2022-05-22" #@param {type:"date"}

# Creating a datetime object for the start date
start_date = datetime.strptime(start,'%Y-%m-%d')
start_date_string =(str(start_date.strftime("%B"))+' '+
                    str(start_date.day)+', '+str(start_date.year))

# Creating a datetime object for our end date
end_date = datetime.strptime(end,'%Y-%m-%d')
end_date_string =(str(end_date.strftime("%B"))+' '+
                    str(end_date.day)+', '+str(end_date.year))

# Finding a list of all dates between start and end date
dates = list(pd.date_range(start_date,end_date,freq='d'))
dates = [datetime.strftime(d,'%Y-%m-%d') for d in dates]

def date_fixer(date):
  # Creating a date time object for the date
  date = datetime.strptime(date,'%Y-%m-%d')
  # Returning the date in the required format
  return str(date.strftime("%B"))[:3]+' '+str(date.day)

# Getting only the weights values
weights = biometrics_df[biometrics_df.get('Metric')=='Weight']

datefixer = lambda x: datetime.strptime(x,'%Y-%m-%d')

weights = weights.assign(Day=weights.get('Day').apply(datefixer))

# Assigning the date_fixer function for getting the required xticks
xticks_fixed = [date_fixer(d) for d in dates]

weights = weights[weights.get('Day')>=start_date]
weights = weights[weights.get('Day')<=end_date]

weights =  weights.assign(Day=weights.get('Day').apply(lambda x: str(x)[:10]))

# Initializing the figure
fig1 = plt.figure(figsize=(14,8),facecolor='#20242A')
ax = fig1.gca()
ax.set_facecolor('#20242A')

# Plotting the Data
plt.plot(weights.get('Day'),weights.get('Amount'),color='#AE9EBD')

# Adding grids
plt.grid(axis="y",lw=2,color='white')
plt.grid(axis='x', alpha=0)

# Setting x and y ticks
plt.yticks(range(int(np.floor(np.min(list(weights.get('Amount'))))),
         int(np.ceil(np.max(list(weights.get('Amount')))+2))),color='white',
         fontsize = 20)
plt.xticks(dates[::2],xticks_fixed[::2],color='white', fontsize = 20,
            fontname='Liberation Sans')

# Setting y limits for the graph
plt.ylim(np.floor(np.min(list(weights.get('Amount')))),
         np.ceil(np.max(list(weights.get('Amount')))+1))


# Removing the spines on top, left and right
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Adding the weight footer
plt.text(0.175,0,'Weight',transform=fig1.transFigure, color='white',
         horizontalalignment='center',fontsize=30,
         family='sans-serif')
plt.text(0.265,-0.05, start_date_string+" - "+end_date_string,
         transform=fig1.transFigure, color='white',horizontalalignment='center'
         ,fontsize=20, fontweight=550, family='sans-serif')
# Displaying the graph
plt.show()
../_images/notebooks_cronometer_40_0.png

^ Above is a plot we created ourselves!

7.3 Plotting participants Workout Minutes§

If the participant wears an Apple Watch or uses another compatible device, they can go onto the Fitness App to find their average workout minutes in a chart like the one below.

b777a21cb6b3460c8c99ec1a1427b18f

Above is a plot from the mobile app itself!

As cronometer automatically syncs with Apple Health, we have access to all this workout data. In this part of the notebook we will be recreating the plot above.

[14]:
#@title Set date range for the chart above

start_date = "2022-04-28" #@param {type:"date"}
end_date = "2022-05-27" #@param {type:"date"}

# The next line creates the date header for the chart
date_range_text= (str(datetime.strptime(start_date,'%Y-%m-%d').day)+
                  ' '+datetime.strptime(start_date,'%Y-%m-%d').strftime("%B")+
                  ' - '+str(datetime.strptime(end_date,'%Y-%m-%d').day)+' '+
                  datetime.strptime(end_date,'%Y-%m-%d').strftime("%B"))

def date_fixer(date):
  # Creating a date time object for the date
  date = datetime.strptime(date,'%Y-%m-%d')
  # Returning the date in the required format
  return str(date.strftime("%B"))[:3]+' '+str(date.day)

exercises_df = pd.DataFrame.from_dict(exercises)
exercises_grouped = exercises_df[exercises_df.get('Exercise')!="Active Energy Balance (Apple Health)"].groupby('Day').sum()

start_date = datetime.strptime(start_date,'%Y-%m-%d')
end_date = datetime.strptime(end_date,'%Y-%m-%d')

dates = list(pd.date_range(start_date,end_date,freq='d'))
dates = [datetime.strftime(d,'%Y-%m-%d') for d in dates]
exercises_grouped = exercises_grouped.reset_index()
exercises_grouped = exercises_grouped.assign(Day_cleaned=exercises_grouped.get('Day').apply(datefixer))

# Assigning the date_fixer function for getting the required xticks
xticks_fixed = [date_fixer(d) for d in dates]


exercises_grouped = exercises_grouped[exercises_grouped.get('Day_cleaned')>=start_date]
exercises_grouped = exercises_grouped[exercises_grouped.get('Day_cleaned')<=end_date]

exercises_grouped = exercises_grouped.set_index('Day')

!wget https://www.fontsquirrel.com/fonts/download/open-sans
!unzip open-sans
!mv OpenSans-Regular.ttf /usr/share/fonts/truetype/
!mv OpenSans-Light.ttf /usr/share/fonts/truetype/
!mv OpenSans-Semibold.ttf /usr/share/fonts/truetype/
!mv OpenSans-Bold.ttf /usr/share/fonts/truetype/
!mv OpenSans-ExtraBold.ttf /usr/share/fonts/truetype/

from matplotlib import font_manager as fm

# Initializing the figure
plt2 = plt.figure(figsize=(14,8),facecolor='#000000')
ax = plt2.gca()
ax.set_facecolor('#000000')
ax.yaxis.tick_right()
ax.set_axisbelow(True)

# Chart mean
chart_mean = int(exercises_grouped.get('Minutes').mean())

# Creating grid lines
plt.grid(axis = 'x',color="#a1a1a1", linestyle='--', linewidth=2, alpha = 0.2)
plt.grid(axis = 'y',color="#a1a1a1", linestyle='-', linewidth=2, alpha = 0.2)

# Plotting the chart
plt.bar(exercises_grouped.index,exercises_grouped.get('Minutes'),width=0.4,
        color='#2C2C2E',joinstyle='round')

plt.axhline(y=int(exercises_grouped.get('Minutes').mean()), xmin=0.1,
            linewidth=7,xmax= 0.9, color='#FF5810',solid_capstyle='round')

# Adding Heart header

openSansBold = '/usr/share/fonts/truetype/OpenSans-Bold.ttf'
opensans_bold = fm.FontProperties(fname=openSansBold, size= 18, weight='bold')
opensans_bold_heading = fm.FontProperties(fname=openSansBold, size= 36, weight='bold')


openSansExtraBold = '/usr/share/fonts/truetype/OpenSans-ExtraBold.ttf'
opensans_ebold = fm.FontProperties(fname=openSansExtraBold, size= 16, weight=500)

find_tens = lambda start, end: [i for i in range(int(start), int(end)+1) if i % 50 == 0]
# Setting x and y ticks
yticks = find_tens(0,max(exercises_grouped.get('Minutes')+50))
plt.yticks(yticks, color='white',fontsize=16, fontweight=600,
           alpha=0.3, fontproperties=opensans_bold)
plt.xticks(exercises_grouped.index[2::4],
           [d.split('-')[2].strip('0') for d in list(exercises_grouped.index)][2::4],
           color='white',fontsize=16, fontweight=600,
           alpha=0.3, fontproperties=opensans_bold)
plt.text(0.79,chart_mean/yticks[-1]+0.07,str(chart_mean)+' min',
         fontsize=20,transform=plt2.transFigure,
         horizontalalignment='center', color = '#FF5810', fontproperties=opensans_bold)


plt.text(0.173,1,"AVERAGE",fontsize=16,color='#89898B',transform=plt2.transFigure,
         horizontalalignment='center', fontproperties=opensans_bold)
plt.text(0.15,0.94,str(chart_mean),
         fontsize=28,transform=plt2.transFigure,
         horizontalalignment='center', color = 'white',
         fontproperties=opensans_bold_heading)
plt.text(0.19,0.94,'min',fontsize=18,transform=plt2.transFigure,
         horizontalalignment='center',color='#89898B',fontproperties=opensans_bold)
plt.text(0.225,0.9,date_range_text,fontsize=14,color='#89898B',
         transform=plt2.transFigure, horizontalalignment='center',
         fontproperties=opensans_bold)

# Plotting the chart
plt.show()

--2023-02-16 01:21:06--  https://www.fontsquirrel.com/fonts/download/open-sans
Resolving www.fontsquirrel.com (www.fontsquirrel.com)... 45.79.150.110, 2600:3c03::f03c:91ff:fe37:ba29
Connecting to www.fontsquirrel.com (www.fontsquirrel.com)|45.79.150.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215743 (1.2M) [application/octet-stream]
Saving to: ‘open-sans’

open-sans           100%[===================>]   1.16M  2.95MB/s    in 0.4s

2023-02-16 01:21:08 (2.95 MB/s) - ‘open-sans’ saved [1215743/1215743]

Archive:  open-sans
  inflating: OpenSans-Light.ttf
  inflating: OpenSans-LightItalic.ttf
  inflating: OpenSans-Regular.ttf
  inflating: OpenSans-Italic.ttf
  inflating: OpenSans-Semibold.ttf
  inflating: OpenSans-SemiboldItalic.ttf
  inflating: OpenSans-Bold.ttf
  inflating: OpenSans-BoldItalic.ttf
  inflating: OpenSans-ExtraBold.ttf
  inflating: OpenSans-ExtraBoldItalic.ttf
  inflating: Apache License.txt
../_images/notebooks_cronometer_47_1.png

^ Above is a plot we created ourselves!

8. Data Analysis§

While doing some research for increasing muscle mass, I stumbled upon an article on this website called HealthyEating by the Dairy Council of California. This article started with a claim that high-protein foods like meat, poultry, fish, beans and peas, eggs and nuts and seeds were also an excellent source of B vitamins.

In this portion of the notebook, we will see if there is a general correlation between a food’s protein makeup and these given nutrients. Before beginning this experiment, let’s replot our servings dataset which contains all the food items logged into the Cronometer app.

[17]:
servings_df.head()
[17]:
Day Food Name Amount Energy (kcal) Alcohol (g) Caffeine (mg) Water (g) B1 (Thiamine) (mg) B2 (Riboflavin) (mg) B3 (Niacin) (mg) ... Leucine (g) Lysine (g) Methionine (g) Phenylalanine (g) Protein (g) Threonine (g) Tryptophan (g) Tyrosine (g) Valine (g) Category
0 2022-03-01 Chipotle, Black Beans 1.00 cup, whole pieces 196.16 0.0 0.0 122.04 0.34 0.09 0.79 ... 0.94 0.81 0.18 0.64 11.95 0.50 0.14 0.34 0.62 Fast Foods
1 2022-03-02 Chipotle, Black Beans 1.00 cup, whole pieces 196.16 0.0 0.0 122.04 0.34 0.09 0.79 ... 0.94 0.81 0.18 0.64 11.95 0.50 0.14 0.34 0.62 Fast Foods
2 2022-03-04 Sausage, Pork, Fresh 3.00 medium link - breakfast size 195.00 0.0 0.0 29.93 0.15 0.11 3.67 ... 0.82 0.74 0.26 0.40 11.12 0.37 0.11 0.31 0.53 Sausages and Luncheon Meats
3 2022-03-05 Vegetable Soup, Plain, Homemade 1.00 cup 97.24 0.0 0.0 234.00 0.26 0.10 1.42 ... 0.18 0.17 0.04 0.12 3.12 0.12 0.04 0.08 0.15 Soups, Sauces, and Gravies
4 2022-03-06 Doritos, Tortilla Chips, Nacho Cheese 1.00 oz 140.68 0.0 0.0 2.26 0.08 0.05 0.79 ... 0.24 0.06 0.04 0.10 2.05 0.08 0.02 0.08 0.10 Snacks

5 rows × 55 columns

This current Data Frame has 55 columns, let’s shorted it down to the columns that we specifically require in our analysis.

[18]:
servings_cleaned = servings_df.get(['Day','Protein (g)','B1 (Thiamine) (mg)',
       'B2 (Riboflavin) (mg)', 'B3 (Niacin) (mg)',
       'B5 (Pantothenic Acid) (mg)', 'B6 (Pyridoxine) (mg)',
       'B12 (Cobalamin) (µg)']).dropna()

servings_cleaned.head()
[18]:
Day Protein (g) B1 (Thiamine) (mg) B2 (Riboflavin) (mg) B3 (Niacin) (mg) B5 (Pantothenic Acid) (mg) B6 (Pyridoxine) (mg) B12 (Cobalamin) (µg)
0 2022-03-01 11.95 0.34 0.09 0.79 0.36 0.14 0.00
1 2022-03-02 11.95 0.34 0.09 0.79 0.36 0.14 0.00
2 2022-03-04 11.12 0.15 0.11 3.67 0.48 0.12 0.59
3 2022-03-05 3.12 0.26 0.10 1.42 0.62 0.29 0.00
4 2022-03-06 2.05 0.08 0.05 0.79 0.11 0.13 0.00

As we there are multiple versions of Vitamin Bs, let’s aggregate them into a single column with the sum of all Vitamin Bs.

[34]:
servings_cleaned = servings_cleaned.assign(Vitamin_B = servings_cleaned.get([
    'B1 (Thiamine) (mg)','B2 (Riboflavin) (mg)', 'B3 (Niacin) (mg)',
    'B5 (Pantothenic Acid) (mg)', 'B6 (Pyridoxine) (mg)',
    'B12 (Cobalamin) (µg)']).sum(axis=1))

servings_cleaned.head()
[34]:
Day Protein (g) B1 (Thiamine) (mg) B2 (Riboflavin) (mg) B3 (Niacin) (mg) B5 (Pantothenic Acid) (mg) B6 (Pyridoxine) (mg) B12 (Cobalamin) (µg) Vitamin_B outlier EE_scores
0 2022-03-01 11.95 0.34 0.09 0.79 0.36 0.14 0.00 1.72 1 -7.972490
1 2022-03-02 11.95 0.34 0.09 0.79 0.36 0.14 0.00 1.72 1 -7.972490
2 2022-03-04 11.12 0.15 0.11 3.67 0.48 0.12 0.59 5.12 1 -0.043979
3 2022-03-05 3.12 0.26 0.10 1.42 0.62 0.29 0.00 2.69 1 -0.796108
4 2022-03-06 2.05 0.08 0.05 0.79 0.11 0.13 0.00 1.16 1 -0.980568

For our plot, let’s see if there is a visual correlation between the Protein content of a food item and their Vitamin B content.

[20]:
import seaborn as sns
# Setting Figure Size in Seaborn
sns.set(rc={'figure.figsize':(16,8)})

# Setting Seaborn plot style
sns.set_style("darkgrid")

#Plotting our data
plot = sns.regplot(data=servings_cleaned, x='Protein (g)', y="Vitamin_B")

#Renaming x and y labels
plot.set_ylabel("Protein (g)", fontsize = 16)
plot.set_xlabel("Vitamin Bs (mg)", fontsize = 16)

print()

../_images/notebooks_cronometer_58_1.png

Based on the graph, it appears that there is a positive association between the protein content and vitamin B content in food. To confirm this relationship statistically, we will utilize linear regression to determine the optimal line of fit and calculate the correlation coefficient and p-value.

To perform the linear regression and obtain the correlation coefficient and p-value, you can use statistical functions from the scipy package in Python. The scipy.stats module provides functions for linear regression, correlation coefficient calculation, and p-value calculation.

[23]:
from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(
    servings_cleaned.get('Protein (g)'), servings_cleaned.get('Vitamin_B'))

print(f'Slope: {slope:.3g}')
print(f'Coefficient of determination: {r_value**2:.3g}')
print(f'p-value: {p_value:.3g}')
Slope: 0.316
Coefficient of determination: 0.5
p-value: 1.73e-15

The p-value is 1.73e-15 is much smaller than the 5% cutoff. This means that there is enough evidence to convincingly conclude that that there is a correlation between protein and vitamin B intake in a food item.

9.0 Outlier Detection§

However, even though our P value seems to provide enough statistical significance that there is a correlation between Protein and Vitamin B makeup, there might be outliers that are not following this correlation. In this section of our analysis, we will find if there are outliers like that and if they exist, we will visually highlight them in our plot.

Before finding the individual outlier values, it would be interesting to see the summary of our Protein and Vitamin B intake. Analyzing a summary of our Protein and Vitamin B intake would be valuable as it provides insight into the typical values and highlights values that may be considered unusual based on the data we collected from Cronometer. The summary shows key statistical measures such as the minimum, maximum, mean, median, and standard deviation of the data, which can give us an idea of the range and distribution of the values.

[31]:
servings_cleaned_summary = servings_cleaned.describe().get(
                                          ['Protein (g)','Vitamin_B']).drop('count')
servings_cleaned_summary
[31]:
Protein (g) Vitamin_B
mean 14.060319 6.310851
std 14.719603 6.574157
min 0.010000 0.000000
25% 3.120000 1.990000
50% 8.960000 5.005000
75% 20.837500 7.842500
max 77.080000 30.640000

To locate the outliers we will be using a supervised as well as unsupervised algorithm called the Elliptic Envelope. In statistical studies, Elliptic Envelope created an imaginary elliptical area around a given dataset where values inside that imaginary area is considered to be normal data, and anything else is assumed to be outliers. It assumes that the given Data follows a gaussian distribution.

“The main idea is to define the shape of the data and anomalies are those observations that lie far outside the shape. First a robust estimate of covariance of data is fitted into an ellipse around the central mode. Then, the Mahalanobis distance that is obtained from this estimate is used to define the threshold for determining outliers or anomalies.” (S. Shriram and E. Sivasankar ,2019, pp. 221-225)

[25]:
from sklearn.covariance import EllipticEnvelope
import copy

# Sometimes EllipticEnvelope shows slicing based copy warnings
# The next line changes a setting that prevents the error from happening

pd.set_option('mode.chained_assignment', None)

#create the model, set the contamination as 0.02
EE_model = EllipticEnvelope(contamination = 0.02)

#implement the model on the data
outliers = EE_model.fit_predict(servings_cleaned.get(
                                                  ['Protein (g)','Vitamin_B']))

#extract the labels
servings_cleaned["outlier"] = copy.deepcopy(outliers)

#change the labels
# We use -1 to mark an outlier and +1 for an inliner
servings_cleaned["outlier"] = servings_cleaned["outlier"].apply(
                                    lambda x: str(-1) if x == -1 else str(1))

#extract the score
servings_cleaned["EE_scores"] = EE_model.score_samples(
                      servings_cleaned.get(['Protein (g)','Vitamin_B']))

#print the value counts for inlier and outliers
print(servings_cleaned["outlier"].value_counts())
1     93
-1     1
Name: outlier, dtype: int64

Below we will replot the servings_cleaned_summary dataframe to see how the two new columns were applied to it!

[26]:
servings_cleaned.head()
[26]:
Day Protein (g) B1 (Thiamine) (mg) B2 (Riboflavin) (mg) B3 (Niacin) (mg) B5 (Pantothenic Acid) (mg) B6 (Pyridoxine) (mg) B12 (Cobalamin) (µg) Vitamin_B outlier EE_scores
0 2022-03-01 11.95 0.34 0.09 0.79 0.36 0.14 0.00 1.72 1 -7.972490
1 2022-03-02 11.95 0.34 0.09 0.79 0.36 0.14 0.00 1.72 1 -7.972490
2 2022-03-04 11.12 0.15 0.11 3.67 0.48 0.12 0.59 5.12 1 -0.043979
3 2022-03-05 3.12 0.26 0.10 1.42 0.62 0.29 0.00 2.69 1 -0.796108
4 2022-03-06 2.05 0.08 0.05 0.79 0.11 0.13 0.00 1.16 1 -0.980568

Now that we have labeled the outliers as -1, let’s try to see which values of protien intake and vitamin B are being identified as outliers by our Elliptic Envelope Algorithm.

[27]:
outlier_df = servings_cleaned[servings_cleaned.get('outlier')=='-1'].get(
    ['Protein (g)','Vitamin_B'])
outlier_df_cleaned = outlier_df.drop_duplicates()
outlier_df_cleaned
[27]:
Protein (g) Vitamin_B
70 77.08 2.45

By matching the outlier values with their respective food items, we can identify which specific foods are contributing to the outliers in our protein vs. vitamin scatterplot. This information can help us understand which foods may be driving the unusual values and potentially provide insights into why these outliers exist.

[39]:
servings_df.loc[outlier_df_cleaned.index]['Food Name'].to_list()
[39]:
['Whey Protein Powder, 24 Grams of Protein per Scoop']

Our algorithm identifies the scoop of protein powder as an outlier due to its high protein content relative to its low Vitamin B content, which causes it to deviate significantly from the overall pattern of the data in the protein vs. Vitamin B scatterplot.

Sweet, now that we know that there were outliers in our dataset, let’s try to visually see which pair of values are being identified as outliers using a plot. Highlighting these outliers in a bright red color will make it super easy for us to identify them in our plot.

[28]:
# Setting Figure Size in Seaborn
sns.set(rc={'figure.figsize':(16,8)})

# Setting Seaborn plot style
sns.set_style("darkgrid")

#Plotting our data
plot = sns.regplot(x='Protein (g)', y='Vitamin_B', data=servings_cleaned.drop(
    outlier_df.index))

plt.scatter(outlier_df_cleaned.get('Protein (g)'),outlier_df_cleaned.get('Vitamin_B'))
plt.scatter(outlier_df_cleaned.get('Protein (g)'),outlier_df_cleaned.get('Vitamin_B'),
            facecolors='red',alpha=.35, s=500)

plt.show()
../_images/notebooks_cronometer_76_0.png

Thus, the points highlighted in red are ones that seem to not be following the general trend of our dataset. Lastly, let’s see what the new p-value is after outlier removal!

[29]:
slope, intercept, r_value, p_value, std_err = stats.linregress(
    servings_cleaned.drop(outlier_df.index).get('Protein (g)'),
     servings_cleaned.drop(outlier_df.index).get('Vitamin_B'))

print(f'Slope: {slope:.3g}')
print(f'Coefficient of determination: {r_value**2:.3g}')
print(f'p-value: {p_value:.3g}')
Slope: 0.409
Coefficient of determination: 0.675
p-value: 5.83e-24

Our new p-value after removing any outliers is 5.83e-24 which is still less than 5% and smaller than our p-value with the outliers included in the dataset. Therefore, after removing the outliers, our result is statistically significant which means that there is enough evidence to conclude that that there is a correlation between Protein content and Vitamin B content in a food item.

[ ]: