The Digital Fingerprint

There can be a dark side to tracking Web visitors, especially if user privacy isn’t properly respected. But the ability to correctly identify visitors can help uncover potential sources of fraud and improve cybersecurity. As a data scientist, I like the idea that machine learning may be able to help.

When users visit a site, their Web browser shares information such as installed system fonts, screen resolution and browser version. These bits of information are all partially identifying. With enough of these data bits, we can uniquely identify the visitor (Figure 1).

The amount of information we need is called the entropy. If we want to identify a single user in a population size of X, then we need user data with at least Log2(X) bits of entropy. For example, assuming there are 8 billion people in the world, we need data with 33 bits of entropy (Log2 of 8 billion) to uniquely identify a single person. The data shared by the Web browser may have enough entropy to serve as a unique fingerprint.

 

unique-fingerprint

Figure 1: The Panopticlick tests the uniqueness of your browser based on the information it shares.

But there’s a problem. The browser settings on devices often change and so do the fingerprints. A study by Panopticlick found that over 37% of return users had a fingerprint that changed. The problem gets larger as time goes by (Figure 2). You can expect that after 15 days, all user fingerprints will be different. (It’s important to note, however, that this particular study encouraged users to change their browser settings; so the results probably overstate how often fingerprints change on average.)

Even though fingerprints can change quickly, it’s likely that you can guess, using a simple algorithm, when a fingerprint is an updated version of a previous fingerprint. Using a simple heuristic, Panopticlick was able to guess correctly 99.1% of the time with a false positive rate of 0.86%.

changing-fingerprints

Figure 2: The results from Panopticlick show portions of fingerprints that change over given periods of time.

Some believe it’s possible to build a system that uses unsupervised machine learning to recognize changes in fingerprints and identify users over time. As early as 2006, Bernaille et al. proposed a technique for using simple K-Means clustering to classify different types of TCP-based applications using the first few packets of Internet traffic flow.

Imagine a system built around a high-speed communication backplane (Figure 3). Connected to the backplane are a fingerprint datastore and a user directory. The website stores fingerprints in the fingerprint database and requests user IDs from the user directory. Working in parallel, there is an unsupervised machine learning algorithm that groups the fingerprints into a large number of very small disjoint clusters. Each cluster represents a persistent unique user. The results are updated in the user directory, which maps users to their changing fingerprints.

fingerprinting-system

Figure 3: This illustrates the basic idea for a system using machine learning to recognize changing fingerprints.

To bring this idea to life, we would have to experiment to find the right set of fingerprint features, the optimal cluster distance metric and the right number of clusters to generate. We can expect that different devices will behave differently. For example, iPhone fingerprints may change far more often than Internet Explorer fingerprints. It would likely be necessary to partition fingerprint data based on device and to optimize the algorithm for each partition.

But the coolest part of this idea is that it can start small. We could start with a simple hash function that does nothing more than create a one-to-one mapping from fingerprint to user, then add more complexity in subsequent iterations.

I’m using this blog post as a sounding board. I’m hoping to get feedback, recommendations and maybe even collaborators for a future effort. If you are interested, please leave a comment or find me on Twitter and let me know what you think.

Top Image Source.


overton-2015Jerry Overton is head of advanced analytics research in CSC’s ResearchNetwork and founder of CSC’s FutureTense competency, which includes the Predictive Modeling Research Group, Advanced Analytics Lab and Predictive Modeling School. Connect with him on Twitter.

Comments

  1. Logan Wilt says:

    Jerry, this is a really neat idea. Now, there may be enough data sent to identify users, but what you described sounds more like a way to identify unique devices/browsers. I use my phone, tablet, and three browsers on my computer to access websites and all would have different fingerprints.

    Liked by 1 person

    • Good point, Logan. And, actually, I’m not sure how to solve the problem of correlating across devices. I imagine the solution would require some form of recognizing behavior common to different fingerprints and using that behavior to infer a connection. Maybe it’s just a recursive application of the same concept that I described in the blog. Maybe in addition to clustering based on device info, we also cluster based on site visit behavior. Perhaps, in that case, the root cluster becomes a guess at users which spans different devices. Not sure. I welcome ideas.

      Like

  2. Kyle Zellman says:

    This is a cool idea and I definitely want to discuss it with you next time we talk. However, I’m curious to understand what you think the value in this is? How do we benefit? How do potential clients benefit?

    Liked by 1 person

    • Security is what I had in mind. I can imagine something like this to be useful to anyone looking for better fraud detection. So, the inventor introduces a valuable innovation and consumers get better security.

      Like

  3. Chris Marin says:

    Folks have looking very closely at this area pretty closely for a while, particularly after the EU cookie legislation started in earnest. There has been quite a bit of advancements since then, particularly with active fingerprinting techniques like using HTML5 canvas:
    http://venturebeat.com/2014/07/30/canvas-fingerprinting-is-tracking-you-and-you-dont-even-know-what-it-is/

    Or clock skew:
    http://deter-project.org/sites/default/files/files/hussain%20alefiya_%20sharma%20swati_saran%20huzur_experience%20with%20heterogenous%20clock-skew%20based%20device%20fingerprinting_acm_laser%20'12_arlington%20virginia_july%2018-19%202012.pdf
    https://wiki.mozilla.org/Fingerprinting

    The most interesting one though is the behavioral approach where you look at how typing cadence:
    http://arstechnica.com/security/2015/07/how-the-way-you-type-can-shatter-anonymity-even-on-tor/

    Here are some additional articles in case they are of interest:
    https://zyan.scripts.mit.edu/presentations/toorcon2015.pdf
    http://valve.github.io/blog/2013/07/14/anonymous-browser-fingerprinting/
    http://www.cnet.com/news/your-web-browsing-history-is-totally-unique-like-fingerprints/

    Of course it is imperative to keep in mind that even fingerprinting is not immune from privacy legislation:
    https://www.privacy-europe.com/blog/eu-device-fingerprinting-require-consent/

    Liked by 1 person

  4. Hi Jerry, suggest you take a look at http://augur.io/#landingPage they appear to have done some of the heavy lifting in this area.

    Liked by 1 person

  5. Hi Jerry,
    These types of techniques, tools and analyses are already common place for counter fraud in many industries, particularly retail banking where they are ubiquitous and include a very high degree of sophistication. There are many vendors in the market with highly mature solutions that focus much more on the factors used for fingerprinting in the first instance to minimise the drift issues you focus on as well as “solving” the multiple device problem. There are already industry shared services that provision the look up and use of hundreds of millions of linked devices for fraud purposes. Certainly a very furtile area for exploration, and I would welcome any advancements that your work might bring!

    Like

  6. Hi Jerry, A very interesting approach. I wonder how it would behave in relation to VDI-clients i.e. Citrix. I’m working for a company who istalls a lot of Citrix environments. and each client device is generated from the same “golden image”. Would it be possible to find uniqueness there ?.

    Like

  7. Cool Jerry… what if there were some little nugget of javascript that pulled the unique “keystroke biometrics”, or maybe some black hatters out there can talk about how a professional fraudster / black hatter prepares a system (e.g. some best practices)… maybe just by locating the anti-pattern it’s enough to drive the risk score up… this person shares the same pattern as other fraudsters, so the risk that they are committing fraud goes up. Then a smart web session can begin to prosecute a bit more intently?

    Just a thought!

    /dan

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: