With a bunch of data about the best performances in different running events, I wanted to learn how to produce meaningful plots and inside with R and ggplot. The data is originally taken from http://www.alltime-athletics.com/men.htm a website by Peter Larsson. I did a lot of postprocessing to be able to handle the data more easily. I also wanted to learn how to work with R and data frames. Most of the source code was taken from StackOverflow or other sides. Especially helpful was also the ggplot documentation. The data set, together with the source code that produces the figures below can be downloaded from Github.
I consider data for the male running events: 100m, 200m, 400m, 800m, 1.500m, 5.000m, 10.000m and the marathon. For each event, there are hundreds to thousands of results. These results are sorted by best performance. I.e., this might be the 1000 best marathon performances. For each of these performances, we have the associated rank that comes along with that performance. But also data like the name, date of birth and nationality of the athlete. Some typical lines in the original data file look like this:
1, 2:02:57 , Dennis Kimetto , KEN , 22.04.84 ,1, Berlin , 28.09.2014
2, 2:03:02 , Geoffrey Mutai , KEN , 07.10.81 ,1, Boston , 18.04.2011
3, 2:03:03 , Kenenisa Bekele , ETH , 13.06.82 ,1, Berlin , 25.09.2016
Meaning that Dennis Kimetto from Kenya ran the marathon world record in Berlin in 2014 with a time of 2:2:57h. The second best performance all-time was from Geoggrey Mutai and the third best performance from Kenenisa Bekele. The data was obtained in July 2018, so there might be some changes in the future.
First, let’s look at the top 50 performances for each event, sorted by the date they were accomplished. The top 10 performances are labeled with the athletes’ name. On the y-axis, we plot the ratio between any performance and the corresponding world record. That is, the 9.58s of Usain Bolt gets the y-label 1, someone running 10.00s would get 10.00s/9.58s = 1.0438.
We can see some interesting things here.
- Certain events are heavily dominated by a few athletes. This includes for example the 100m/200m by Usain Bolt, the 400m by Michael Johnson, the 800m by Wilson Kipketer and David Rudisha, very much the 1.500m by Hicham El Guerrouj, and the 5.000m/10.00m by Kenenisa Bekele and Haile Gebrselassie.
- Some events are ‘hip’, others are not. For example, most top performances for the 1.500m, 5.000m and 10.000m are rather old while the marathon has very many recent top performances.
- The marathon is dominated by Kenya and Ethiopia and the 100m are dominated by Jamaica and the USA.
- The top 50 performances for the marathon, the 1.500m and 800m have a smaller deviation than other events. By this I mean the ratio between the top 50 performance and the top 1 performance is relatively small. See also the next figure.
- The 9.58s of Usain Bolt are an insane outlier. The second best performance (which is also his) is a lot slower.
- Hicham El Guerrouj holds the oldest world record within the events considered.
For the 4th point, we can do a plot of the ratio over rank to visualize how it scales for different events. We again normalize with the corresponding world record and color code by event. The x-axis is log-scaled.
One could argue that a smaller slope of the above curve makes an event more “elite” than other events, since e.g. going from rank 100 to rank 1000 would come with a smaller increase in the ratio than for an event with larger slope. Thus, we lose a lot of ranks by getting a little bit slower, i.e. the top field is more dense.
On the GOATS
There are some Greatest Of All Times (GOAT) in running. It is a long discussion as for whom to consider. But I wanted to visualize some results for Gebrselassie, El Guerrouj, Bekele and Kipchoge. We will plot the performances that are within 10% of the corresponding world record time, colored by event and ordered by accomplished date.
What do we see? El Guerrouj sticked to the 1.500m pretty much (except for his Olympic gold in the 5.000m…). Bekele and Gebrselassie competed in the 5.000m and 10.000m roughly equally much and at the same part of their career. Bekele had a sharp cut in 2014 from where on he competed in the marathon. Gebrselassie dropped the 5.000m in favor of the marathon but still did some 10.000m races. Kipchoge dropped the 5.000m and everything else for the marathon and had significantly more success in that discipline (his Breaking 2 attempt is not part of this list).
It is also interesting to see that El Guerrouj had a pretty consistent time of the year in which he delivered his performances while all the others delivered top performances all over the year.