Every cyclist of the Tour de France in a single CSV file

Actually, it is just every finisher of the Tour de France between 1903 and 2017. It was quite a pain to gather all the data and had to be done manually to some extend. Therefore it is entirely possible that some errors were made. I will give a detailed description for the generation of the data set. All used scripts and the final data can be found on Github.  The picture above shows that the Tour is getting shorter and faster.

Step 1 – getting all websites

Starting point for everything was the official Tour history site. Looking at the site source, we see something like


                                                                                                                                                          <div class="dateTabs__item js-tabs-parent">
                                        <button class="dateTabs__link js-tabs"
                                                data-tabs-target="1904"
                                                data-xtclick="history::navTabs::item::link::1904"
                                                data-clicktype="A"
                                                data-tabs-ajax="/en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8">1904</a>
                                    </div>
                                                                                                                                                                                                        <div class="dateTabs__item js-tabs-parent">
                                        <button class="dateTabs__link js-tabs"
                                                data-tabs-target="1905"
                                                data-xtclick="history::navTabs::item::link::1905"
                                                data-clicktype="A"
                                                data-tabs-ajax="/en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8">1905</a>
                                    </div>
                                                                                                                                                                                                        <div class="dateTabs__item js-tabs-parent">
                                        <button class="dateTabs__link js-tabs"
                                                data-tabs-target="1906"
                                                data-xtclick="history::navTabs::item::link::1906"
                                                data-clicktype="A"
                                                data-tabs-ajax="/en/block/history/10710/ead1d1704b1600c795619e84ca511fc3">1906</a>
                                    </div>

I needed all endings like this

/en/block/history/10707/d0ab6a216569236433268b7f19e1776c

which are the domain endings for a specific year of the Tour. I used the regular expression

(?<=data-tabs-ajax=")([\s\S]*?)(?=") 

to gather all the domain endings in a file. Here is a preview of the file

/en/block/history/10707/d0ab6a216569236433268b7f19e1776c
/en/block/history/10707/d0ab6a216569236433268b7f19e1776c
/en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8
/en/block/history/10708/0b76b8f809ad5d8bcf3579df597644d8
/en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8
/en/block/history/10709/c5f53ced72a23f333cc1866d8d2850e8
/en/block/history/10710/ead1d1704b1600c795619e84ca511fc3

I.e. I have every domain ending twice but that does not matter. Now I used a simply Python script to dump every website in a file.

 
import os
import subprocess
prefix = 'letour.fr'

id = 1
with open("domainendings.txt", "r") as ins:
    for line in ins:
        id += 1
        if id%2==0:
            url = prefix+line
            output = 'raw/id_'+str(int(id/2))+'.txt'
            mycommand = 'w3m -dump -cols 1000 ' + url
            result = subprocess.check_output(mycommand, shell=True)
            
            file = open(output,"w")
            file.write(result)
            file.close

Step 2 – processing the output

This was hell. I do not like the script that I ended up with. Everything about this sucks but it had to be done. I struggled because of multiple reasons:

  • the format that was used was not really consistent throughout the years
  • cyclists have weird names
  • teams have weird names
  • names included thinks like: (,),’,”,*,… parsing was a mess
  • the timing results of the 1997 and 2006 Tour were simply wrong

Anyway. I did something like

python3 postprocess.py>>summary.csv

The script is in the repository. Don’t judge me, please…

Step 3 – the manual part

It was the second part of hell and probably the part were errors could have been made. I basically iterated the import process within R. I tried to read the csv file, it failed, I fixed the error, tried again, failed again, and so on. Problems included all of the above problems again. Most of the fixes were done with Vim.

Step 4 – the timings

Some times were not listed, I left them that way. The timings for the Tour 1997 and 2006 were just wrong, you can look for you self. I fixed them by using the “Gap” data and the time of the winners from the official Tour side. I used Matlab to come up with (hopefully) the correct timings.

Step 5 – the data frame

If you only care about the data frame this is four you. Some R code:


library(tidyverse)
df <- read_csv("letour.csv", 
               col_types = cols(
                 year = col_integer(),
                 rank = col_integer(),
                 name = col_character(),
                 id = col_integer(),
                 team = col_character(),
                 time = col_character(),
                 h = col_integer(),
                 m = col_integer(),
                 s = col_integer()
               )
)

Here is how it looks.


> df
# A tibble: 9,040 x 9
    year  rank name                         id team     time             h     m     s
 1  1903     1 MAURICE GARIN                 1 TDF 1903 94h 33m 14s     94    33    14
 2  1903     2 LUCIEN POTHIER               37 TDF 1903 97h 32m 35s     97    32    35
 3  1903     3 FERNAND AUGEREAU             39 TDF 1903 99h 02m 38s     99     2    38
 4  1903     4 RODOLPHE MULLER              33 TDF 1903 99h 12m 44s     99    12    44

One important remark regarding the “time” field. For the 1997 and 2006 Tour, the “time” field is the original and wrong input that I took from the tour side while the fields “h”, “m” and “s” are the correct processed ones.

Additionally there is also a summary of every year of the Tour. This contains the year, number of stages,  total km and the average speed. This even contains the average speed if there are no timing results for the riders in that year (e.g. the Tour 1905).


1903,6,2428,25.679
1904,6,2428,25.265
1905,11,2994,27.107
1906,13,4637,24.463
1907,14,4488,28.47
1908,14,4488,28.74
1909,14,4497,28.658
1910,15,4734,29.099
1911,15,5343,27.322

 

Remarks

The whole procedure is only partially reproducible since I fixed many errors “manually”. This is obviously not good. However, I simply struggled to catch every problematic case within the Python script so I did it manually.

Timings are partially not included and for the 1997 and 2006 Tour I had to fix it. Someone might want to double check that.

While the official Tour site contains information like the “Gap” to the winner and two more categories “B” and “P” (bonus and ?), I did not include them in the final file. This is because I simply struggled with the way the different years were formatted. It was less error prone to simply exclude the data. However the gap can easily be calculated from the data.

Remarks 2:  Time vs Score

There is an error on the Tour website where instead of the timings, the score was stored. See this quote from the Github repository

The following problems are known:
1905 – 0
1906 – 0
1907 – p
1908 – 0
1909 – p
1910 – p
1911 – p
1912 – p
Where 0 means that all times are zero and p means that the column for the timing actually contained the overall points and not the timings (this was an error on the LeTour.fr website).

Thus, the years 1905-1912 do not seem trustworthy. I am not yet sure how to fix this. However, also Wikipedia lists not the winning times but the winning scores for the years 1905-1912, see here.

Remarks 3: Further postprecessing

I corrected the problem in Remark 2 by adding categories to the data that specify whether we see timing results or scores. I also added the average speed of the cyclists, which sometimes happen to match the “Avg. Pace” given on the Tour website (i.e. the Tour website mentions the pace for the winner), but not always as you can check yourself with the 1981 Tour and the data on the Tour website. Again this is something that is not consistent within the Tour website.
Note furthermore that this data set might be updated in the future and I will not add any more remarks. Everything will be mentioned on Github though.

Leave a Reply