Running distance and pace distribution with R

Getting the data – again

Similar to this post, I again gathered my data. This time however, I bulk exported everything from Polar instead of Garmin (there is an app called SyncMyTracks that synchronizes different services).

I did so using the tool polar-flow-export from Github. I then wrote a Python script that extracts all the relevant data from the resulting .tcx files into a single CSV file. I needed the python-tcxparser for that which can be installed via

pip install python-tcxparser

The Python script can then be called like

python3 ../out/

where the ../out/ argument is the folder to which I exported all the .tcx files in the first step. Here is the file:

import tcxparser
import glob
import os
import sys
import csv

def main():
myDir = (sys.argv[1:][0])
numberFilesImported = 0
numberFilesFailed = 0

with open('output.csv','w') as out:
writer = csv.writer(out)
for file in glob.glob("*.tcx"):
tcx = tcxparser.TCXParser(file)
numberFilesImported += 1

pace = int(tcx.pace[0:2])*60+int(tcx.pace[3:5])
tmp= tcx.completed_at.split("T")
date = tmp[0]
time = tmp[1].split(".")
data = [date,time[0],tcx.activity_type,tcx.distance, tcx.duration,tcx.pace,pace,tcx.latitude,tcx.longitude]#,tcx.hr_avg,tcx.hr_min,tcx.hr_min]
#data = [tcx.activity,tcx.activity_type,tcx.distance, tcx.duration,tcx.pace,tcx.ascent,tcx.descent,tcx.latitude,tcx.longitude,tcx.hr_avg,tcx.hr_min,tcx.hr_min]

print("\t!!!!!! FILE COULD NOT BE PARSED !!!!!!")
numberFilesFailed += 1

print(f'Imported {numberFilesImported} files')
print(f'Failed to import {numberFilesFailed} files')

if __name__ == "__main__":

Not pretty but it does the job. My resulting CSV file starts as follows:


Visualizing with R

We load the data, filter the running activities and add categories for paces (easy, slow, normal,fast,very fast,…)


df <- read_csv("out/output.csv",
col_names = c("date","time","type","distance","duration","paceInMin","paceInSec","latitude","longitude"))
dfrunning <- df[df$type=="running",]

catpace <- cut(dfrunning$paceInSec,c(0,seq(180,360,30),10000))
dfrunning$catpace <- catpace

catdistance <- cut(dfrunning$distance,c(seq(0,max(dfrunning$distance),1000)))
dfrunning$catdistance <- catdistance

At first I want to see which distances I usually run, color coded in colors that represent the given pace.

stat_bin(binwidth=1, geom="text", colour="white", size=3.5,
aes(label=..count.., group=catpace), position=position_stack(vjust=0.5)) +
scale_x_continuous(breaks=seq(0,max(dfrunning$distance), 1))+
labs(title = "Running distribution", x = "Distance in km", y = "Count", fill = "Pace in sec/km") +


So not much of a surprise. I run rather smaller distances and there are some peaks at 5,10,13 and 15 km which I can explain (frequently ran routes that long) and one peak at 3km which I am not too sure about.  Pacewise, I think I have been running to slow, I guess. But maybe there the next plot helps. I wanted to rescale every bar to 100% and change the pace distribution accordingly. I needed help from StackOverflow to produce the next figure.

dfrunning %>%
select(distance, catpace) %>%
mutate(dist = round(distance/1000)) %>%
group_by(dist, catpace) %>%
mutate(test = n()) %>%
distinct(dist, catpace, test) %>%
group_by(dist) %>%
mutate(pct = test/sum(test)*100) %>%
ggplot(aes(x= dist, y = pct)) +
geom_bar(aes(fill=catpace), stat = "identity") +
geom_text(aes(group=catpace,label = paste0(round(pct, 0),"%")),
colour="white", size=3.5, angle = 90,
position = position_stack(vjust = 0.5)) +
labs(title = "Running distribution",
x = "Distance in km", y = "Percentage",
fill = "Pace in sec/km")+

What I see here is maybe a lack of quick runs for distances long than ~10km (the brown part). Also longer runs were dominated mostly by 5min/km pace. However I am not really getting a lot out of this so I will try to convince some of my friends to give me their data and make some comparisons. Maybe I also need to adjust the pace groups.

Furthermore, I very much appreciate ideas as for what to analyze. Obviously this was just a teaser and more details will follow. Mostly this was me gathering the data and learning a bit more about R and ggplot2.

Leave a Reply