It may seem counter-intuitive, but even synthetic data can have privacy impacts.
Companies and governments alike demand masses of data for operational, policy and research uses, but between large scale data breaches and the re-identification of published datasets, privacy risks abound.
Is synthetic data the solution?
There are a number of use cases for which synthetic data might be proposed as safer than ‘real’ data. These include:
- training new staff how to use a CRM system before letting them loose with a login
- developing new patient management software
- demonstrating how a data analytics program could work to map or analyse customer journeys
- testing whether protected taxation data will be useful for a research project before going to the effort of applying to access the real data, or
- generating fake student records at scale for an education-themed hackathon.
However as we discovered on a deep dive into this topic for a client recently, there can still be privacy impacts arising from synthetic data, such that both legal and ethical risks need to be carefully managed.
Entirely fake data could be generated from scratch, but to do so is time-consuming, difficult to do at scale, and the result would likely not ‘look’ real. Unless the same statistical properties have been carried across from real data, fake data will likely not be useful for anything. Hence the market for ‘synthetic’ data.
The Canadian Privacy Commissioner defines synthetic data as “fake data produced by an algorithm whose goal is to retain the same statistical properties as some real data, but with no one-to-one mapping between records in the synthetic data and the real data”. The UK’s ICO has a similar definition.
The idea is to produce fake data that ‘looks’ real, because it “retains the same structure and level of granularity as the original”. But without including any links back to the ‘real’ data, synthetic data should – in theory – resolve the privacy risks found when using the original data.
How is synthetic data made?
To develop synthetic records at scale, in a way which generates data that is realistic enough to be useful, but without any one-to-one mapping which could led to ready identification of real people, requires a source dataset containing the ‘real’ records, and a generative model.
The source dataset
This is the original dataset containing records about a group of ‘real’ individuals, such as supermarket customers, hospital patients, taxpayers or university students. The synthetic data to be generated will be expected to emulate the statistical properties of this source data.
Some source datasets will include data fields which clearly and directly identify individuals, from direct identifiers such as names and customer numbers. Others might include indirect or ‘quasi’ identifiers such as date of birth, gender and postcode, which in combination can render some individuals unique in the dataset (i.e. distinguishable from the rest of the group) and thus ‘identifiable’ for the purposes of privacy law.
Some source datasets might have already robustly controlled for both direct and indirect identifiers via de-identification techniques, but the ‘attribute’ data is itself rich enough that some individuals will be unique in the dataset, and thus ‘identifiable’ in law. For example, even without any direct or indirect identifiers about a patient, the parts of a patient record which show event dates (such as hospital admission, surgery and discharge dates) and clinical information (such as conditions or treatment) can be enough to render a patient unique in the dataset.
In all such cases, the source dataset contains ‘personal information’.
Let’s take the example of a source system for a hypothetical supermarket customer loyalty scheme. The dataset has 50% male and 50% female customers. If you look into purchasing history, you might find that 18% of customers regularly purchase tampons. However if you break down purchases by gender, you find that it is 35% of female customers and 1% of male customers who regularly buy tampons.
The generative model
This is the statistical model used to generate the synthetic data. It is derived from the source dataset. Statistical tables are created from the source dataset, which show things like the distribution of certain features amongst the total population within the source dataset. However the tables will not necessarily reflect all the underlying distributions of variables, or the correlations between them.
The statistical properties of the source data might include, for example, the distribution of supermarket customers across gender, age ranges, and geographic regions, as well as purchase histories, but without necessarily correlating one data field (such as gender) to another (such as purchases).
So one statistical table drawn from our hypothetical supermarket customer dataset will show that 50% of customers and male and 50% are female. Another will show that 18% of all customers regularly purchase tampons. No correlations between those two factors has been recorded.
These tables are then used to create the generative model.
The synthetic records
This is the data generated from the generative model: many thousands of individual records of ‘fake’ individuals, now sitting in a new database which can be queried by users.
Because the data about each synthetic individual is generated by working backwards from statistical tables, unless correlations were designed into the generative model, attributes will be distributed across the synthetic customer records according to the statistical properties of the source dataset.
So in the synthetic version of our hypothetical supermarket customer dataset, 18% of male customers’ records will show tampons in their purchase histories, as will 18% of female customers’ records.
It may seem counter-intuitive to assess synthetically created or ‘fake’ data for privacy risks, but there are three key reasons why such assessment is necessary.
Compliance at creation
Unless synthetic data is created completely from scratch, its creation will involve the use of a ‘real’ dataset as its starting point. If that source dataset contains personal information – i.e. potentially identifiable information about individual humans – any use of that dataset for a particular purpose will need to comply with the ‘Use’ principle in the applicable privacy law.
And if the source dataset was compiled from third party sources, such as by ‘scraping’ data from the internet, there will be legal implications in relation to compliance with ‘Collection’ principles too.
These compliance challenges with the source data may be enough to kill a project before it gets off the ground. (Though you wouldn’t suspect that was the case, since the race between a small number of Big Tech companies to build generative and other AI tools seems to involve conveniently ignoring these legal requirements, along with respect for intellectual property and moral rights over creators’ work.)
Again, unless synthetic data has been created completely from scratch, the manner in which it is created could lead to some re-identification risks being carried over from the source dataset. Academics have warned that synthetic data is not the “silver-bullet solution” promised to provide “perfect protection” against re-identification attacks: “If a synthetic dataset preserves the characteristics of the original data with high accuracy, and hence retains data utility for the use cases it is advertised for, it simultaneously enables adversaries to extract sensitive information about individuals”.
For example, let’s say there is an individual in the source dataset whose unique combination of variables or attribute data makes them ‘identifiable’ from the rest of the cohort. If the generative model is sophisticated enough to reflect correlations between all the variables, only one synthetic record will be created with that same combination of variables or attribute data.
Re-identification, leading to the unauthorised disclosure of personal information about real people from a synthetic dataset, thus remains a legal compliance and reputational risk to be managed. The UK Information Commissioner’s Office has noted that there is no standard available as to how synthetic data should be generated, and warns that “additional measures (eg Differential Privacy) may be required to protect against singling out”.
(‘Singling out’ is a phrase in UK/European data protection law to mean that an individual may be distinguished from the group, and thus ‘identifiable’ for the purposes of the definition of ‘personal data’, or ‘personal information’ as it is known in Australia.)
This is why a re-identification risk assessment should be completed for any synthetic datasets which were generated from a source dataset containing personal information.
If a synthetic dataset poses some chance of re-identification, the use and disclosure of the synthetic data will need to be managed as ‘personal information’, in accordance with the relevant privacy principles. Either the dilution of data utility if you more robustly control for re-identification, or the privacy law compliance costs if you don’t, may undermine the business case for generating synthetic data in the first place.
Ethical challenges and concerns
Finally, regardless of the above legal compliance and re-identification risks, there are some ethical considerations to manage.
Data ethics in a synthetic world
From specialist privacy newsletters to mainstream media, the press has been saturated of late with analysis of the implications of generative AI and its ability to create and disseminate ‘fake’ personas, images, art, text and more.
Examples of generated content causing ethical concerns have ranged from ‘deep fake’ videos (such as a fake video of former US President Obama), photos (such as fake photos of former US President Trump) and audio (a fake interview with Steve Jobs generated well after his death), to deep fake porn used to target and harm individual women. In each of these cases, manipulation is occurring so as to create the impression that real people said or did particular things, when in truth they did not.
There is analysis pointing to the potentially biased and discriminatory outcomes of artificially generated personas, images, text or data, which tend to disproportionately harm women and other historically disadvantaged groups. Others have highlighted the intellectual property theft on which some generative AI tools have been built.
However there is less literature about the ethical implications of synthetic unit record level data, generated in the absence of audio or visual elements, and then used for research or other purposes.
We have identified five ethical concerns which may arise in relation to the use of synthetic data.
Loss of trust
Dr Rune Klingenberg from the Danish National Center for Ethics has written about the ethical implications of synthetic personas: “even more worrying than any particular piece of misinformation is the general loss of trust in the authenticity of images and the plausible deniability of facts that comes along with the advancement of technologies capable of producing convincing fake material”.
Incorrect recognition of real people
Even if they are incorrect, if a person with access to a synthetic dataset believes that they recognise an individual in a synthetic dataset, and as a result they ‘learn’ new facts about that person such as a medical condition (whether that fact turns out to be correct or not), privacy harm can be done to a real person. This is because the definition of personal information in law holds true whether or not the information discovered about a person is true.
Some methods of controlling for re-identification risk include suppressing some data fields (e.g. ethnicity or country of birth), and/or removing small cell counts (e.g. people with unique combinations of variables). However this may further entrench disadvantage, if some groups are then excluded from analysis. Similarly, any inherent biases in the original data will be carried through to the synthetic data.
Therefore if the synthetic data is to be used to make policy or operational decisions that could have consequences for individuals, it will be important to detect and correct bias in the generation of synthetic data, and ensure that the synthetic data is representative.
Just as is the case with ‘real’ data, inferences drawn from synthetic data could lead to policy or operational decisions which impact negatively on vulnerable populations. Disregard to the risk of such outcomes would be contrary to the principles of ethical design and beneficence.
For example, while not creating a privacy risk for any individual patients, inferences drawn and publicised about poor health outcomes for patients prescribed a particular drug could lead to a loss of confidence in their treatment regime. Patients might face disadvantage if the inferences were used to cut government subsidies for the drug, and even poorer health outcomes might result if patients stopped taking their medicine.
If the synthetic dataset is not truly representative of the real population, or if the synthetic data only partially represents the source dataset and the correlations between all the underlying variables, there is also a risk that a person with access to the synthetic dataset could draw inferences or statistical conclusions which are invalid, leading to poor quality policy or operational decisions, and possibly alarming results.
To take our hypothetical supermarket customer loyalty dataset to its logical but absurd end, reliance on the synthetic dataset could lead the supermarket chain to waste its marketing budget offering specials on tampons to its male customers, or tampon manufacturers might decide they should re-design their product to better suit their male customers.
In addition, the revelation that inferences drawn from the synthetic dataset were incorrect could in turn undermine confidence in the integrity of the source dataset, and the program of data collection which underpins it.
The narrative that has to be presented to users of synthetic data and the public is one in which the dataset contains both valid and invalid data, and the inferences drawn may or may not be valid. This is a difficult message to convey.
So what can be done?
Like proposals to ‘watermark’ ChatGPT outputs to prevent students cheating, labelling synthetic data appropriately is one possible solution. However more than transparency alone, projects should involve careful design of the generative model, attuned to the likely users and uses of the synthetic records.
Or in the words of Tracey Kerrigan in the Jenny Jenny / fake flowers scene from The Castle:
The trick is to make ‘em real, but not too real … just real enough to know that they’re fake.
Photograph © Kelly Common on Unsplash
Want to know more about how to use or build trustworthy AI? Follow our pragmatic guidance and training in the Algorithms Bundle.