Actually, it is just every finisher of the Tour de France between 1903 and 2017. It was quite a pain to gather all the data and had to be done manually to some extend. Therefore it is entirely possible that some errors were made. I will give a detailed description for the generation of the data set. All used scripts and the final data can be found on Github. The picture above shows that the Tour is getting shorter and faster.
Step 1 – getting all websites
Starting point for everything was the official Tour history site. Looking at the site source, we see something like
<div class="dateTabs__item js-tabs-parent"> <button class="dateTabs__link js-tabs" data-tabs-target="1904" data-xtclick="history::navTabs::item::link::1904" data-clicktype="A" data-tabs-ajax="/en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8">1904</a> </div> <div class="dateTabs__item js-tabs-parent"> <button class="dateTabs__link js-tabs" data-tabs-target="1905" data-xtclick="history::navTabs::item::link::1905" data-clicktype="A" data-tabs-ajax="/en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8">1905</a> </div> <div class="dateTabs__item js-tabs-parent"> <button class="dateTabs__link js-tabs" data-tabs-target="1906" data-xtclick="history::navTabs::item::link::1906" data-clicktype="A" data-tabs-ajax="/en/block/history/10710/ead1d1704b1600c795619e84ca511fc3">1906</a> </div>
I needed all endings like this
which are the domain endings for a specific year of the Tour. I used the regular expression
to gather all the domain endings in a file. Here is a preview of the file
/en/block/history/10707/d0ab6a216569236433268b7f19e1776c /en/block/history/10707/d0ab6a216569236433268b7f19e1776c /en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8 /en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8 /en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8 /en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8 /en/block/history/10710/ead1d1704b1600c795619e84ca511fc3
I.e. I have every domain ending twice but that does not matter. Now I used a simply Python script to dump every website in a file.
import os import subprocess prefix = 'letour.fr' id = 1 with open("domainendings.txt", "r") as ins: for line in ins: id += 1 if id%2==0: url = prefix+line output = 'raw/id_'+str(int(id/2))+'.txt' mycommand = 'w3m -dump -cols 1000 ' + url result = subprocess.check_output(mycommand, shell=True) file = open(output,"w") file.write(result) file.close
Step 2 – processing the output
This was hell. I do not like the script that I ended up with. Everything about this sucks but it had to be done. I struggled because of multiple reasons:
- the format that was used was not really consistent throughout the years
- cyclists have weird names
- teams have weird names
- names included thinks like: (,),’,”,*,… parsing was a mess
- the timing results of the 1997 and 2006 Tour were simply wrong
Anyway. I did something like
The script is in the repository. Don’t judge me, please…
Step 3 – the manual part
It was the second part of hell and probably the part were errors could have been made. I basically iterated the import process within R. I tried to read the csv file, it failed, I fixed the error, tried again, failed again, and so on. Problems included all of the above problems again. Most of the fixes were done with Vim.
Step 4 – the timings
Some times were not listed, I left them that way. The timings for the Tour 1997 and 2006 were just wrong, you can look for you self. I fixed them by using the “Gap” data and the time of the winners from the official Tour side. I used Matlab to come up with (hopefully) the correct timings.
Step 5 – the data frame
If you only care about the data frame this is four you. Some R code:
library(tidyverse) df <- read_csv("letour.csv", col_types = cols( year = col_integer(), rank = col_integer(), name = col_character(), id = col_integer(), team = col_character(), time = col_character(), h = col_integer(), m = col_integer(), s = col_integer() ) )
Here is how it looks.
> df # A tibble: 9,040 x 9 year rank name id team time h m s 1 1903 1 MAURICE GARIN 1 TDF 1903 94h 33m 14s 94 33 14 2 1903 2 LUCIEN POTHIER 37 TDF 1903 97h 32m 35s 97 32 35 3 1903 3 FERNAND AUGEREAU 39 TDF 1903 99h 02m 38s 99 2 38 4 1903 4 RODOLPHE MULLER 33 TDF 1903 99h 12m 44s 99 12 44
One important remark regarding the “time” field. For the 1997 and 2006 Tour, the “time” field is the original and wrong input that I took from the tour side while the fields “h”, “m” and “s” are the correct processed ones.
Additionally there is also a summary of every year of the Tour. This contains the year, number of stages, total km and the average speed. This even contains the average speed if there are no timing results for the riders in that year (e.g. the Tour 1905).
1903,6,2428,25.679 1904,6,2428,25.265 1905,11,2994,27.107 1906,13,4637,24.463 1907,14,4488,28.47 1908,14,4488,28.74 1909,14,4497,28.658 1910,15,4734,29.099 1911,15,5343,27.322
The whole procedure is only partially reproducible since I fixed many errors “manually”. This is obviously not good. However, I simply struggled to catch every problematic case within the Python script so I did it manually.
Timings are partially not included and for the 1997 and 2006 Tour I had to fix it. Someone might want to double check that.
While the official Tour site contains information like the “Gap” to the winner and two more categories “B” and “P” (bonus and ?), I did not include them in the final file. This is because I simply struggled with the way the different years were formatted. It was less error prone to simply exclude the data. However the gap can easily be calculated from the data.
Remarks 2: Time vs Score
There is an error on the Tour website where instead of the timings, the score was stored. See this quote from the Github repository
The following problems are known:
1905 – 0
1906 – 0
1907 – p
1908 – 0
1909 – p
1910 – p
1911 – p
1912 – p
Where 0 means that all times are zero and p means that the column for the timing actually contained the overall points and not the timings (this was an error on the LeTour.fr website).
Thus, the years 1905-1912 do not seem trustworthy. I am not yet sure how to fix this. However, also Wikipedia lists not the winning times but the winning scores for the years 1905-1912, see here.
Remarks 3: Further postprecessing
I corrected the problem in Remark 2 by adding categories to the data that specify whether we see timing results or scores. I also added the average speed of the cyclists, which sometimes happen to match the “Avg. Pace” given on the Tour website (i.e. the Tour website mentions the pace for the winner), but not always as you can check yourself with the 1981 Tour and the data on the Tour website. Again this is something that is not consistent within the Tour website.
Note furthermore that this data set might be updated in the future and I will not add any more remarks. Everything will be mentioned on Github though.