There can be a dark side to tracking Web visitors, especially if user privacy isn’t properly respected. But the ability to correctly identify visitors can help uncover potential sources of fraud and improve cybersecurity. As a data scientist, I like the idea that machine learning may be able to help.
When users visit a site, their Web browser shares information such as installed system fonts, screen resolution and browser version. These bits of information are all partially identifying. With enough of these data bits, we can uniquely identify the visitor (Figure 1).
The amount of information we need is called the entropy. If we want to identify a single user in a population size of X, then we need user data with at least Log2(X) bits of entropy. For example, assuming there are 8 billion people in the world, we need data with 33 bits of entropy (Log2 of 8 billion) to uniquely identify a single person. The data shared by the Web browser may have enough entropy to serve as a unique fingerprint.
Figure 1: The Panopticlick tests the uniqueness of your browser based on the information it shares.
But there’s a problem. The browser settings on devices often change and so do the fingerprints. A study by Panopticlick found that over 37% of return users had a fingerprint that changed. The problem gets larger as time goes by (Figure 2). You can expect that after 15 days, all user fingerprints will be different. (It’s important to note, however, that this particular study encouraged users to change their browser settings; so the results probably overstate how often fingerprints change on average.)
Even though fingerprints can change quickly, it’s likely that you can guess, using a simple algorithm, when a fingerprint is an updated version of a previous fingerprint. Using a simple heuristic, Panopticlick was able to guess correctly 99.1% of the time with a false positive rate of 0.86%.
Figure 2: The results from Panopticlick show portions of fingerprints that change over given periods of time.
Some believe it’s possible to build a system that uses unsupervised machine learning to recognize changes in fingerprints and identify users over time. As early as 2006, Bernaille et al. proposed a technique for using simple K-Means clustering to classify different types of TCP-based applications using the first few packets of Internet traffic flow.
Imagine a system built around a high-speed communication backplane (Figure 3). Connected to the backplane are a fingerprint datastore and a user directory. The website stores fingerprints in the fingerprint database and requests user IDs from the user directory. Working in parallel, there is an unsupervised machine learning algorithm that groups the fingerprints into a large number of very small disjoint clusters. Each cluster represents a persistent unique user. The results are updated in the user directory, which maps users to their changing fingerprints.
Figure 3: This illustrates the basic idea for a system using machine learning to recognize changing fingerprints.
To bring this idea to life, we would have to experiment to find the right set of fingerprint features, the optimal cluster distance metric and the right number of clusters to generate. We can expect that different devices will behave differently. For example, iPhone fingerprints may change far more often than Internet Explorer fingerprints. It would likely be necessary to partition fingerprint data based on device and to optimize the algorithm for each partition.
But the coolest part of this idea is that it can start small. We could start with a simple hash function that does nothing more than create a one-to-one mapping from fingerprint to user, then add more complexity in subsequent iterations.
I’m using this blog post as a sounding board. I’m hoping to get feedback, recommendations and maybe even collaborators for a future effort. If you are interested, please leave a comment or find me on Twitter and let me know what you think.
Top Image Source.
Jerry Overton is head of advanced analytics research in CSC’s ResearchNetwork and founder of CSC’s FutureTense competency, which includes the Predictive Modeling Research Group, Advanced Analytics Lab and Predictive Modeling School. Connect with him on Twitter.