Are you embarrassed to admit that you don’t know your statistical linkage keys from your house keys? Think ‘hashing’ is something you do to potatoes, and ‘adding salt’ is something you do to hot chips? Imagine ‘rainbow tables’ have something to do with pre-schoolers’ craft-time? Assume ‘adding noise’ refers to something your teenagers do?
Then read on, my friend. You need de-identification to be demystified for you, stat.
Some people are happiest when working with words, and others are happiest when working with numbers. So coming from my word-loving arts/law background, I was the statistical outlier when I was invited to join a panel of experts at the OAIC’s #deIDworkshop in Canberra late last year. There was some serious intellectual firepower represented amongst the cryptographers (that’s something to do with maths, apparently), statisticians and data scientists. Quite a few PhDs sitting at that table. I felt distinctly non-boffin.
Information Commissioner Timothy Pilgrim opened the workshop by noting that de-identification “can be a smart and contemporary response to the privacy challenges of big data – which aims to separate the ‘personal’ from the ‘information’ within data sets”.
What followed was a robust debate amongst the panellists about how to define or measure de-identification, as well as how best to achieve it. There was much discussion about whether releasing unit record data publicly can ever be considered ‘safe’ from re-identification risk.
But what I found most compelling about our discussions at the workshop was the absence of a common language to help explain either privacy law to the we-like-numbers people, or maths and statistics to the we-prefer-words people. This is a critical failing, because privacy professionals need to understand de-identification, to do their job properly. (By ‘privacy professionals’ I mean the privacy officer, the lawyer, the governance or compliance manager, the audit and risk committee: anyone who needs to understand and apply privacy law or data protection rules, and assess risk, for their organisation.)
The Productivity Commission’s recent report into data use makes the point that data breaches from poor data security are much more common than those from re-identification attacks on ‘open data’. But de-identification is not only useful as a privacy-protective tool in relation to ‘open data’; it is also useful for protecting data that shouldn’t see the light of day at all.
For the individual whose data is at stake, de-identification matters because if they cannot be identified from a dataset, then they are less likely to suffer a privacy invasion. Even if the data is on an unencrypted and not-even-password-protected smartphone that is lost at an airport, if no-one can identify from the data that Phillip Bloggs was in the dataset, then Phillip is less likely to be publicly embarrassed by the accidental disclosure of his membership of the N Sync Fan Club. (Of course privacy harms can occur even without identification, but that’s a separate topic.)
For the organisation holding personal information, de-identification matters because it is simply sensible risk management. Preventing harm to the individuals whose data you hold, and protecting the reputation of your organisation, requires privacy professionals to utilise a broad range of privacy, information security and data loss prevention controls. Having de-identification as part of your toolkit means you can not only improve compliance with privacy rules, but also better leverage the value of your data.
But privacy professionals can’t apply the law without first understanding the relative merits and limitations of different de-identification techniques. We don’t need to become technical experts – but we do need to know the right questions to ask of the technical experts, so that we can assess the legal and reputational risk of what they propose, and know which privacy controls to apply when.
We should no longer be bewildered or bamboozled by terminology like ‘SLKs’ and ‘k-anonymity’, ‘differential privacy’ and ‘hashing’.
We have taken a fictional group of high school students with fun names like Kanye Peacock and Beyonce Phoenix, and illustrated how different de-identification techniques would apply to a dataset of their exam results. Our guide runs through what aggregation means, what k-anonymity means and how to achieve it, and what pseudonymity means. We explain the circumstances in which Kanye can be re-identified from his ‘indirect identifiers’ (and thus how a recipient of the data could figure out that he flunked his Spanish exam), and how some data recipients might be able to figure out Angelina Cherry’s test scores even from ‘aggregated’ data.
We explain what it looks like when you replace Beyonce’s name, date of birth and gender with a statistical linkage key, and the circumstances in which you might use a pseudonym like an SLK instead of a different de-identification technique.
(Spoiler alert: Beyonce’s SLK is HONEY121120002. Note that gender is not coded as M or F, but as 1 or 2, in which male = 1 and female = 2. Who are the sexist bastards who came up with that standard, huh? I think that instead of a 2 at the end of her SLK, Beyonce should get an A for Awesome. But perhaps that’s just me.)
And you will see the limitations of each method. For example, if there is another girl with the same date of birth as Beyonce in the dataset whose name is Fey Phoenetics, she will have the same SLK, and so their records could be linked together erroneously. As will Teyla Chomney’s and Leyla Thorn’s. (Can you tell I had fun coming up with those names? But I digress…. ) The protections offered by ‘hashing’ some data elements can be undermined by attackers using ‘rainbow tables’, but can be strengthened if instead you first ‘salt’ the data.
We’ve even covered ‘differential privacy’ and the technique of ‘adding noise’ to data, which is now a hot topic thanks to Apple. (Who would have thought that solving the great privacy challenge of our time would be given a boost by the need to figure out people’s favourite emojis?)
Our guide ends with a checklist of factors to consider for any given de-identification proposal.
We don’t profess to be experts in statistical techniques; far from it. There are excellent, detailed, lengthy guides available from regulators around the world, to guide academic and clinical researchers, data scientists and statisticians in how to de-identify data for their particular purposes.
But if you’re a privacy professional who just wants to understand how de-identification fits into privacy or data protection law, a simple illustration of how each different technique works, and a plain language overview of the strengths and weaknesses you need to factor into your risk assessment considerations, then this guide is for you.
Demystifying de-identification: An introductory guide for privacy officers, lawyers, risk managers and anyone else who feels a bit bewildered is now available as an eBook from the Salinger Privacy online store.
Photograph (c) Shutterstock