Hollywood heartthrob Bradley Cooper is a bad tipper. That was the conclusion drawn by media – though denied by his PR rep – when data about 173 million New York taxi trips became public.
But I drew a different and more disturbing conclusion, which was how easy it is to get privacy ‘wrong’, when a government official is trying to get transparency ‘right’. Here’s what happened.
In March 2014, the New York City Taxi & Limousine Commission (known by the rather sweet acronym TLC) released under FOI data recorded by taxis’ GPS systems. The dataset covered more than 173 million individual taxi trips taken in New York City during 2013. The FOI applicant used the data to make a cool visualisation of a day in the life of a NYC taxi, and published the data online for others to use.
Top marks for government transparency, useful ‘open data’ for urban transport and planning research … but not, as it turns out, great for Bradley Cooper.
Each trip record included the date, location and time of the pickup and drop-off, the fare paid, and any recorded tip. It also included a unique code for each taxi and taxi driver.
In theory the identity of each taxi and taxi driver had been ‘anonymised’ by the use of ‘hashing’ – a one-way encryption technique which replaced each driver licence number and taxi medallion number with an alphanumeric code, that can’t be reverse-engineered to determine the original information.
However as a computer scientist who found the published dataset pointed out, hashing is not a good solution when you know what the original input might have looked like. So if you know what taxi numbers look like (and that’s not difficult – they are printed on the side of the taxi), you can run the ‘hash’ against all possible taxi numbers. It took a software developer less than an hour to re-identify each vehicle and driver for all 173 million trips. So anyone can now calculate any individual driver’s income for the year, or the number of miles they have driven, or where they were at any given time. So much for their privacy.
But what’s this got to do with Bradley Cooper, you ask?
Geolocation data exposes behaviour
While the computer science community started debating the limits of hashing and how TLC should have ‘properly’ anonymised their dataset before before releasing it under FOI, an astute postgrad student found that even if TLC had removed all the details about the driver and the taxi, the geolocation data alone could potentially identify taxi passengers.
To demonstrate this, Anthony Tockar googled images of celebrities getting in or out of taxis in New York during 2013. Using other public data like celebrity gossip blogs, he was able to determine where and when various celebrities got into taxis. Using the TLC dataset, Anthony could then identify exactly where Bradley Cooper went, and how much he paid. (Mind you, cash tips are not recorded, hence the debate about whether or not he is a bad tipper.)
Anthony also developed an interactive map, showing the drop-off address for each taxi trip which had begun at a notorious strip club. I imagine the same could be done to easily identify the start or end-point for each taxi trip to or from an abortion clinic, a drug counselling service, or the home address of an investigative journalist, suspected whistle-blower or partner suspected of cheating.
As Anthony notes, “with only a small amount of auxiliary knowledge, using this dataset an attacker could identify where an individual went, how much they paid, weekly habits, etc … ‘I was working late at the office’ no longer cuts it”.
Open data or open sesame?
The publication of the NYC taxi dataset illustrates a particular challenge for privacy professionals: how easily an individual’s identity, pattern of behaviour, physical movements and other traits can be extrapolated from a supposedly ‘anonymous’ set of data, published with good intentions in the name of ‘open data’, public transparency or research.
Other recent examples have included the re-identification of ‘anonymous’ DNA donors to the ‘Thousand Genomes Project’ research database, and the re-identification of bike riders using London’s public bicycle hire scheme. As the blogger who turned the publicly available bike trips dataset into interactive maps noted, allegedly ‘anonymous’ geolocation data, even when months old, can allow all sorts of inferences to be drawn about individuals – including their identity, and their behaviour.
Closer to home, the NSW Civil & Administrative Tribunal has shown they are willing to challenge, rather than blindly accept, assertions by government agencies about the identifiability of data, by accepting that if a “simple internet search” links the data in question back to an individual’s identity, the published data will meet the definition of ‘personal information’, and its publication or disclosure becomes contestable under privacy law.
For privacy professionals, the goalposts are shifting. An assumption that data has been ‘de-identified’, and is thus not subject to privacy restrictions, may no longer hold true.
Privacy Officers would do well to engage with their colleagues who might publish or release datasets, such as people working in FOI, open data, corporate communications or research, to ensure they understand the risks of re-identification, and know about their disclosure obligations under privacy law.
You don’t want your ‘open data’ to become ‘open sesame’.
(April 2018 update: If you would like some privacy tools to help you assess the risks posed by a new open data project, or if you are wondering how the GDPR’s requirement to incorporate Data Protection by Design should be implemented in practice, check out our range of Compliance Kits to see what suits your needs.)
Photograph © Shutterstock