Detecting and defeating browser spoofing

Posted: 2017-05-03
By Lachlan Kang

Today we’ll be discussing how Browserprint uses machine learning and your fingerprint to guess what browser family your browser is from, and what operating system you’re using. The motivation for guessing these properties is to see if we can defeat fingerprint spoofing, particularly user-agent string spoofing, as this is simplest and most common form of spoofing. Because of this we ignored user-agent string when guessing browser families and operating systems, except when otherwise specified.

We find that our method of browser guessing provides accuracy much better than random guessing. In fact we’ll show that we can detect the true browser and operating system of a browser that is spoofing these things around 76% of the time, and that we can guess the operating system and browser family of browsers in general approximately 90% of the time, all with a final training set of less than 1000 fingerprints (imagine what could be done with 10,000).

The chance of a random guess being correct in our dataset around 43% when guessing browser, and 42% when guessing operating system, so these are the numbers to beat; the reason these numbers are so high is because some browsers and operating systems appear far more in our dataset. We calculated these random guess accuracies by doing simulated random guessing 10,000,000 times and keeping a tally of how often guesses were correct.

To simulate random guessing we essentially did two simulated spins of a roulette wheel, where each slice of the wheel represented a browser family or operating system and the slice’s size was based on how many times they occurred; if the two spins landed on the same slice then the guess was correct, otherwise it was wrong. To address possible anomalous results we repeated the simulated random guessing several times with different random number generator seeds to ensure that the results were consistent. Please note that this is just the chance of random guessing within our dataset, other datasets may have different distributions of browsers and operating systems, leading to different random guess accuracy.

The post then went on to explain the methodology, results, and conclusions… We might publish the rest later if/when we make contact with the author.