The truth behind location data anonymity

In the world of location data privacy, you may think that removing your personally identifiable information secures your anonymity. Unfortunately, you may still be discovered – and here’s how.

Your car, your cellular carrier, or an app on your smartphone are the most common things that are likely tracking your location. Understandably, mentioning this might prompt some nervous feelings. Before we explore this any further, please take comfort in two facts. First, most companies are using tracking data responsibly in order to create safer and smarter data-driven services. Second, most companies (HERE included) sincerely want to protect your privacy. In fact, laws enforcing privacy protection are being enacted around the world.

What we’re looking at today is particularly sensitive aspect of data privacy: location data privacy. It’s a unique challenge due to the potential safety implications and the ways that devices track you.

As you move through the world, your devices aren’t necessarily limited to generating single points of separated data. Rather, they can create a linked set of data points that are more than the sum of their parts. Travelling from place to place produces whole sequence of locations and timestamps that come together to resemble a path on a map. That whole sequence, which we call a trajectory, can be particularly revealing and is what can make this category of privacy more challenging other categories.

Have you ever heard of “being in the right spot at the right time?”

Of course, you have. It means something fortunate happened to someone in a specific place and time. To a statistical researcher or a data scientist, it means a low-probability spatio-temporal event occurred. What is key to understand here is that we’re combining the probability of an event with a very specific space and a very specific moment.

To make this easier to picture, think of all the people in the world at this very moment who are wearing glasses. The probability that you are wearing glasses right now is reasonably high – so separating you out from everyone else would be a big challenge. Now, add location. The probability that someone other than you is wearing glasses in the space where you are in this very moment is extremely low. Introducing location data to the equation makes it easier to discover you.

This is why location data privacy is so tricky. A company which is tracking you can completely remove your personal identifiable information from the data points and trajectories which they would later make public. Anyone (especially parties other than the company publishing the data) can potentially add their own insights or other related publicly available data to those published trajectories and potentially use that combination of data to identify you or the individual in question. In fact, MIT researches have long known that it’s possible to identify you using only four location datapoints.

This is not theoretical — it happens a lot

In a well-documented case study, the New York City Taxi and Limousine Commission released the full dataset of every taxi ride in the city in 2013. The information they released included the taxis’ locations, pickup and drop-off times, fares, and tips to the drivers. None of the drivers’ names, license numbers, or other personally identifiable information was published. Each car’s information was replaced with a numeric identifier, or a hash.

This is where the external, publicly available related information plays its part.

A photojournalist noticed they had stock of pictures of celebrities getting out of cabs in front of buildings. The time and place of those photos was documented, and the numbers on the taxis were visible in many of the pictures.

All the photographer needed to do was combine the information from any single photo with the data point which was collected at that same time and place in the NYC Taxi database – and there it was. They discovered the full sequence of data points from that specific taxi for the entire day. That in turn revealed where the taxi picked up the celebrity before, or where they were dropped off after the photo was taken.

This example was a journalist taking a dataset and applying some common sense. When data scientists and researchers took the same information, they were able to uncover the names of the taxi drivers, their home addresses, their income, and their detailed driving patterns. What’s more, they accomplished this using only publicly available information.

There are a variety of other privacy breach examples that demonstrate the revealing potential of linked location data, such as a student was able to locate military bases in the Middle East through anonymized data from a fitness application. Elsewhere, data researchers used combination of anonymized ratings data from a movie streaming service and Internet Movie Database (IMDb) to identify specific users and the users’ entire viewing history.

De-identification doesn’t guarantee anonymity

What we’ve established is that removing your personal information from location data does not automatically make you anonymous. Yet, there is a more important takeaway: companies like ours, who offer an open data platform, must work smarter to defend user privacy. The research that builds better businesses requires trust from the people providing the data behind that research. We recognize that maintaining that trust is a deeply important job.

While we must protect user privacy as a priority, it is also our business to be certain we are preserving enough value in our data for the creation of better services. How do we strike that balance?

In Part II of this article, we will look at how to realize a win-win situation where privacy is protected while maximizing data value for our services.