If your organisation is looking at hosting or participating in a data hackathon, in order to solve complex policy challenges, meet customer needs or identify business opportunities, you will need to resolve a number of privacy challenges first. In this blog we present an overview of the issues to think about.
But first, what do we mean by a data ‘hackathon’? The implications arising from the use of the term ‘hacker’ may give it a bad rap – which is why some call these events a ‘datathon’ instead – but the general idea is to get together a bunch of people including data scientists and subject matter experts, possibly in competing teams, to use available data to quickly generate insights and build potential solutions.
A data hackathon will typically be focussed on one type of problem, like adapting to climate change, urban transport planning, or improving clinical care. The ‘available data’ might come from participating or sponsoring organisations, and/or public datasets.
For each organisation which offers up its datasets, there are four main questions which need to be answered:
- Can we allow the data to be used for a hackathon?
- What other obligations do we have?
- Is de-identified data still ‘personal information’ within the meaning of the applicable privacy law?, and
- What data security measures will be necessary?
Can we use the data for a hackathon?
Even before you get to the question of de-identifying data before providing it as part of a hackathon, the decision to enable use of unit record level data for a hackathon is a matter which needs to be determined in relation to the data as it is held by each organisation. For example, the act of de-identifying ‘personal information’ in order to supply it to a hackathon event is itself a ‘use’ of personal information.
In other words, consideration needs to start with the assumption that the data is covered by privacy laws, and that there will be restrictions on its secondary use, including use for the purpose of de-identifying it before providing it for a hackathon.
Each organisation will need to consider its own circumstances, and the particular privacy law/s which apply. Typically, you will need to look at something like a ‘directly related secondary purpose’ test under the applicable ‘Use’ principle.
Let’s say that you want to use information about retail customers’ purchase histories, returns, enquiries and complaints for the purpose of better understanding customer needs or improving service delivery; this might be considered a ‘directly related secondary purpose’ to the purpose for which the original data was collected.
However the use of a retailer’s customer data for other purposes, like matching customer records across different organisations, may not meet this test. Therefore the challenges set by the hackathon organisers will be critical to determining whether or not it is appropriate to grant permission for the use of your organisation’s customer data in the first place.
In other words, the objective of the hackathon itself must be related to solving genuine operational or policy problems faced by your organisation (even if those problems are common to or shared with multiple organisations, such as how to reduce carbon emissions, help the tourism sector recover from the pandemic, or improve children’s educational outcomes). It cannot be used solely for the benefit of other organisations.
Each organisation should therefore carefully review, and help to construct, the objectives of the hackathon itself, and the challenges which will be set for hackathon participants, to ensure that the problems to be ‘solved’ are directly related to operational or policy problems faced by the participating organisations.
What other obligations do we have?
Determining the question above as to whether or not the information can lawfully be used is not the end of your legal obligations under privacy laws. You also need to consider whether the information can be disclosed to third parties such as participants in the hackathon, and also how you can comply with your data security obligations to protect the personal information from loss, misuse or unauthorised disclosure.
In particular, robust de-identification techniques, and other security measures, will be critical to ensuring the privacy of your customers is protected. These are explained further below.
Is de-identified data still ‘personal information’?
Although there are differences between privacy laws, typically the legal definition of ‘personal information’ depends on the identifiability of data, and regulatory guidance and case law suggests that identifiability is not to be considered in a vacuum. The test for whether or not a piece of data meets the definition of ‘personal information’ often therefore relates to whether or not any individual is identifiable from that piece of data, either alone or when that piece of data is combined with other available data. In the context of the hackathon, the purpose of which is typically to deliberately link up data from disparate sources, or over time, in order to build up a profile about a particular entity (such as an individual or a household), the likelihood of being able to identify (or re-identify) individual properties or households is increased.
The removal or masking of names and other identifiers will likely not be enough to robustly de-identify the data, such that the data is no longer ‘personal information’ for the purposes of compliance with privacy law.
Participating organisations should therefore work on the assumption that the data to be supplied will continue to meet the definition of ‘personal information’ even after de-identification techniques have been applied, and that therefore all legal obligations will still apply. As noted above, compliance with both Disclosure and Data Security obligations will be critical.
However this is not to suggest that de-identification has no value. Instead, de-identification should be seen as one essential tool in your data security toolkit.
What data security safeguards will be necessary?
Data security should be viewed as a whole. Both the state of the data (after it has been treated with de-identification techniques), and the environment in which it is to be released or held, are relevant to an overall assessment of the ‘safety’ of a data release.
There are also multiple levers or controls which could be applied, such that if one control (such as the security of the environment in which the data is to be held and used for the hackathon) is dialled up very high, other controls (such as de-identification) can potentially be dialled down.
This is important, because de-identification alone as a security control is likely to fail, and because de-identification typically involves a trade-off with data utility.
In other words, the more usable you want the data to be for the hackathon, the less you will be able to rely on de-identification alone to meet your data security obligations, and therefore the more you will need to rely on other security measures.
In terms of de-identification, you will need to consider:
- Whether the data is to represent a ‘snapshot’, or is longitudinal. Longitudinal data is much more difficult to de-identify.
- Whether the methods of masking or otherwise treating identifiers can be reverse engineered in a re-identification attack, as occurred with the MBS/PBS ‘open data’ release which led to breaches of the Privacy Act
- Whether there are data fields (especially unstructured or free-text fields) which might hold information that could reveal data in unexpected ways, as occurred with the Flight Centre hackathon which led to breaches of the Privacy Act
- Whether the data includes behavioural or pattern data which might itself be indirectly revealing of individuals in unexpected ways, as occurred with the Myki hackathon which led to breaches of the Victorian privacy laws. To the extent that the data might show patterns of behaviour or movements of individuals (e.g. ‘smart’ billing data from an energy retailer, or location data from an app or a mobile phone service provider), these risks should be carefully controlled for.
- What additional data is available (either publicly, or held by hackathon participants themselves) which could be matched with the customer data to enable re-identification. Examples might include ‘white pages’ directories, public registers such as the electoral roll and company registers, news reports, social media posts, and stolen or leaked customer data now available on the dark web.
Any hackathon proposal should be subject to an expert analysis of re-identification risk, to test if the data is safe for release within the hackathon environment.
Other security measures
In addition to the poor de-identification techniques noted above, some of the failures evident in the Flight Centre and Myki hackathons included:
- Not restricting access to only hackathon participants (e.g. sharing data via the open web)
- Not preventing participants from exporting the data
- Not preventing participants from importing other data (which increases the re-identification risk)
- Not vetting participants in advance
- Not binding participants by contract
- Not enforcing such contracts
Hackathon organisers and participants should therefore consider:
- Ensuring there is clarity amongst all participating organisations and their data custodians about which party is responsible for protecting each dataset and identifying and managing privacy risks
- Limiting invitations to a known and fixed list of hackathon participants
- Ensuring the data is maintained entirely on a secure data enclave, to prevent extraction by participants
- Consider prohibiting participants from importing additional data
- Vetting and binding hackathon participants as if they were a contracted service provider to your organisation; provisions should include that they must only use the data for solving the challenges set by the hackathon organisers; that they must not attempt to re-identify the data, or on-disclose the data; and that they must destroy any holdings of the data (if they even have any) at the end of the hackathon
- Reminders at the start of the event about these rules
- Have dedicated communications and reporting pathways if things go wrong, including a Data Breach Response Plan, and protocols for handling any privacy enquiries or complaints
So don’t let your data hackathon become a honeypot for hackers. Treat all data about individuals as ‘personal information’, and apply privacy considerations and controls accordingly.
Photograph © Clarisse Croset on Unsplash