Using machine learning to remove Cars from Run and Ride leaderboards

Multiesporte

How does the ‘Cars on Segments’ Machine Learning Model work?
Strava’s Cars on Segments Model helps identify if any part of any activity uploaded to Strava was recorded in a car, motorcycle, train, or even plane. Every single run and ride that is uploaded to Strava passes through this model. The model receives the activity, evaluates it, then sends our systems back a recommendation.
Any recommendation the model returns is on a scale of 0 to 1 and is called a probability.
You can think about it like so: if the model returns a 0.9999 it believes there’s a 99.99% probability that a vehicle is present at some point during the activity. If the probability is greater than a certain level (called a classification threshold), we flag the activity before it reaches any leaderboards and the user is prompted to crop out the vehicle portion or make the activity private.
How does our model determine this probability?
It’s actually pretty simple. The model starts by calculating a series of “features” from the underlying activity data as soon as the activity is finished uploading. We’ve developed 57 features that this model uses to differentiate cars from bikes. Some of these are simple calculations like averages and variances of velocity or acceleration. There are also more complicated features like “jerk.” This is the derivative of acceleration or “average VAM on climbs” which is a cycling specific metric based on the limits of human performance that are documented on Strava. We’ve also considered the effect of varying gradients as well as momentum.
One of our favorite features, called the “Sendrix Coefficient,” deserves special mention. This was developed in-house with one of Strava’s fastest staff cyclists, who rides under the name Jimi Sendrix. We asked him how fast he could accelerate from a dead stop to 20mph, and how many times he could do this on a ride before exhausting himself. We coded this into the model and it proved to be a great feature in helping the model differentiate cars from bikes. Even elite cyclists get tired and accelerate slower over the course of a ride, but cars never get tired and continue to accelerate quickly.
For each of these features, we ask the model to explain how it used the feature using SHAP values. These values weigh the data from each feature more towards cars or more towards bikes. There might be some overlap between cars and bikes on many features, but when you start looking at all these features as a whole, the differences between vehicles and bikes become more clear.
Here’s an example: If the top speed on a bike activity was 80mph, the model would weigh that feature heavily towards “car” since it’s nearly impossible to go that fast on a bike. If the top speed was 25mph, that may not inform the model as much since bikes can go that fast and some car trips in urban areas might never break 25mph. In this case, the model will reference other features. It will score each activity using all features then add up the scores to determine its probability of being in a vehicle on the 0 to 1 scale.
Sometimes it’s easy to differentiate a bike from a car:
Sometimes it’s hard to differentiate between a bike and a car:
How we trained our model
We trained a gradient boosted decision tree classifier using XGBoost, which is a widely used and effective open source machine learning library. We trained our model on tens of thousands of activities on Strava which we’ve identified as containing a vehicle or being completely recorded in a vehicle.
Here’s a visual representation of how these decision trees work:
Some features matter more than others, but it’s the sum of all these small decisions that helps our model make its recommendation. Based on our testing, our model can identify as many as 81% of activities containing vehicles uploaded to Strava and flag them before they take your QOM or local legend.
What’s next
First, we will release another model which prevents incorrectly labeled bike rides from disrupting run leaderboards.
After that, we will release a third model which differentiates between e-bikes and regular rides so that ebikes no longer disrupt ride leaderboards.
And finally, we will reprocess every top 10 on Strava to ensure vehicles are removed, runs are actually runs, and rides were done without electricity. This will take us several more months, but we will keep you updated as we make progress. Our systems won’t be perfect, but we are committed to making Strava’s leaderboards as fair as we possibly can.