Running distance and pace distribution with R

Getting the data – again

Similar to this post, I again gathered my data. This time however, I bulk exported everything from Polar instead of Garmin (there is an app called SyncMyTracks that synchronizes different services).

I did so using the tool polar-flow-export from Github. I then wrote a Python script that extracts all the relevant data from the resulting .tcx files into a single CSV file. I needed the python-tcxparser for that which can be installed via

pip install python-tcxparser

The Python script can then be called like

python3 main.py ../out/

where the ../out/ argument is the folder to which I exported all the .tcx files in the first step. Here is the main.py file:


import tcxparser
import glob
import os
import sys
import csv

def main():
myDir = (sys.argv[1:][0])
os.chdir(myDir)
numberFilesImported = 0
numberFilesFailed = 0

with open('output.csv','w') as out:
writer = csv.writer(out)
for file in glob.glob("*.tcx"):
print(file)
try:
tcx = tcxparser.TCXParser(file)
numberFilesImported += 1

pace = int(tcx.pace[0:2])*60+int(tcx.pace[3:5])
tmp= tcx.completed_at.split("T")
date = tmp[0]
time = tmp[1].split(".")
data = [date,time[0],tcx.activity_type,tcx.distance, tcx.duration,tcx.pace,pace,tcx.latitude,tcx.longitude]#,tcx.hr_avg,tcx.hr_min,tcx.hr_min]
#data = [tcx.activity,tcx.activity_type,tcx.distance, tcx.duration,tcx.pace,tcx.ascent,tcx.descent,tcx.latitude,tcx.longitude,tcx.hr_avg,tcx.hr_min,tcx.hr_min]
print(data)
writer.writerow(data)
out.flush()

except:
print("\t!!!!!! FILE COULD NOT BE PARSED !!!!!!")
numberFilesFailed += 1

print(f'Imported {numberFilesImported} files')
print(f'Failed to import {numberFilesFailed} files')

if __name__ == "__main__":
main()

Not pretty but it does the job. My resulting CSV file starts as follows:

2012-04-16,10:24:35,running,13680.0,4192.0,05:06,306,50.78088236,6.09665447
2012-04-18,10:47:02,running,7238.54607310699,2115.0,04:52,292,50.78210075,6.09782727
2012-04-22,14:09:09,running,28535.796184113904,10571.0,06:10,370,50.77468025,6.09629815
2012-05-05,13:11:48,running,16168.325999543544,7308.0,07:31,451,50.74850298,6.08372496
2012-05-06,13:39:53,running,25033.0,9180.0,06:06,366,50.77482007,6.09631483

Visualizing with R

We load the data, filter the running activities and add categories for paces (easy, slow, normal,fast,very fast,…)


library(tidyverse)
library(ggplot2)
library(dplyr)

df <- read_csv("out/output.csv",
col_names = c("date","time","type","distance","duration","paceInMin","paceInSec","latitude","longitude"))
dfrunning <- df[df$type=="running",]

catpace <- cut(dfrunning$paceInSec,c(0,seq(180,360,30),10000))
dfrunning$catpace <- catpace

catdistance <- cut(dfrunning$distance,c(seq(0,max(dfrunning$distance),1000)))
dfrunning$catdistance <- catdistance

At first I want to see which distances I usually run, color coded in colors that represent the given pace.

ggplot(dfrunning,aes(x=distance/1000))+
geom_histogram(aes(fill=catpace),binwidth=1)+
stat_bin(binwidth=1, geom="text", colour="white", size=3.5,
aes(label=..count.., group=catpace), position=position_stack(vjust=0.5)) +
scale_x_continuous(breaks=seq(0,max(dfrunning$distance), 1))+
labs(title = "Running distribution", x = "Distance in km", y = "Count", fill = "Pace in sec/km") +
theme_pubr()

 

So not much of a surprise. I run rather smaller distances and there are some peaks at 5,10,13 and 15 km which I can explain (frequently ran routes that long) and one peak at 3km which I am not too sure about.  Pacewise, I think I have been running to slow, I guess. But maybe there the next plot helps. I wanted to rescale every bar to 100% and change the pace distribution accordingly. I needed help from StackOverflow to produce the next figure.


library("ggpubr")
dfrunning %>%
select(distance, catpace) %>%
mutate(dist = round(distance/1000)) %>%
group_by(dist, catpace) %>%
mutate(test = n()) %>%
distinct(dist, catpace, test) %>%
group_by(dist) %>%
mutate(pct = test/sum(test)*100) %>%
ggplot(aes(x= dist, y = pct)) +
geom_bar(aes(fill=catpace), stat = "identity") +
geom_text(aes(group=catpace,label = paste0(round(pct, 0),"%")),
colour="white", size=3.5, angle = 90,
position = position_stack(vjust = 0.5)) +
labs(title = "Running distribution",
x = "Distance in km", y = "Percentage",
fill = "Pace in sec/km")+
theme_pubr()

What I see here is maybe a lack of quick runs for distances long than ~10km (the brown part). Also longer runs were dominated mostly by 5min/km pace. However I am not really getting a lot out of this so I will try to convince some of my friends to give me their data and make some comparisons. Maybe I also need to adjust the pace groups.

Furthermore, I very much appreciate ideas as for what to analyze. Obviously this was just a teaser and more details will follow. Mostly this was me gathering the data and learning a bit more about R and ggplot2.

Leave a Reply