Is synthetic data a universal solution for data privacy?

Ivana Bartoletti, Global Chief Privacy & AI Governance Officer at Wipro, weighs up whether synthetic data can truly curb data extractivism and its carbon cost, or if it merely reshapes today’s privacy and oversight dilemmas in a new guise.

Data is often referred to as the new oil. Although the comparison is incorrect (data can be reused, oil cannot), it does provide the sense that data is a valuable resource that fuels innovation, influences decisions, and drives the global economy. However, the current model of data collection and use, often termed ‘data extractivism’, raises significant concerns regarding data privacy, environmental impact, and ethical considerations. As we grapple with these challenges, synthetic data is often seen as a potential solution – albeit one that presents both opportunities and risks.

The scale of data collection

Recent estimates suggest that around 2.5 quintillion bytes of data are produced worldwide every day. This includes everything from social media posts and online purchases to IoT sensor data and satellite imagery.

A 2019 forecast estimated that by 2025, 463 exabytes of data will be generated globally each day. However, this massive data generation and collection comes at a cost, both in terms of individual privacy and the environment.

Impact on data privacy

The extensive collection of personal data poses significant risks from a privacy stand-point. Data breaches are becoming increasingly common, exposing sensitive information from millions of users. Furthermore, the aggregation of data from various sources enables detailed profiling of individuals, potentially leading to discriminatory practices in areas such as employment, insurance, and credit scoring.

A lack of transparency in data collection and use further exacerbates these concerns. Many users are unaware of the extent to which their data is collected, shared, and monetised. This undermines trust in digital services and raises questions about the power dynamic between technology companies and the individual user.

Environmental impact

The data economy also has a significant ecological footprint and impact on the environment. Data centres, which store and process the vast amounts of collected data, are powered by enormous energy consumption. Estimates suggest that data centres account for about 1% of global electricity consumption. As data generation continues to grow exponentially, so does the energy demand for its storage and processing, resulting in CO2 emissions and contributing to climate change.

Synthetic data: A promising alternative?

Synthetic data refers to artificially generated information that mimics the statistical properties of real data without containing actual personal information. This approach offers several advantages in addressing the problems associated with the current data collection model:

Enhanced data privacy: By using synthetic data, organisations can develop and test applications, train machine learning models, and conduct research without fear of exposing real personal information. It can also be valuable in developing digital twins which significantly reduces data privacy risks with data breaches and unauthorised access. In this sense, synthetic data can be counted among the broad category of Privacy Enhancing Technologies (PETs).
Reduced need for data collection: Synthetic data can supplement or even re-place real data in many applications, potentially reducing the need for extensive collection of information about individuals. This could help mitigate privacy con-cerns associated with the current data extractivism model.
Environmental benefits: By reducing the need to store and process large amounts of real user information, synthetic data can help reduce the energy consumption and carbon footprint of data centres. This development is in line with efforts to make the digital economy more sustainable.
Improved data availability: Synthetic data can be generated to represent rare scenarios or underrepresented groups, addressing issues like bias and underrepresentation in existing datasets. This can lead to more inclusive and fair AI systems and data-driven decision-making processes, which can have important implications for the healthcare sector, for example, where historically grown databases are far from bias-free.
Compliance with data protection regulations: The use of synthetic data can help companies comply with data protection regulations such as the GDPR, making sure no real personal data is processed.
Cost efficiency: Generating synthetic data can be more cost-effective than collecting and managing large amounts of real data, especially for scenarios that are rare or difficult to capture in real life.

Pitfalls and challenges with synthetic data

While synthetic data offers many advantages, the associated challenges and potential drawbacks should not be overlooked:

Re-identification risks: Even though no real personal information is involved, there is still a risk that individuals could be re-identified if the synthetic data closely mimics patterns from the original dataset. Ensuring confidentiality while maintaining utility is a major challenge requiring careful consideration and advanced methods.
Quality: It is crucial that synthetic data accurately reflects the complexity and nuances of real data. If the datasets used do not capture key patterns or relation-ships in real data, this can lead to inaccurate models or flawed insights in analytical or machine learning applications.
Efficiency trade-offs: While synthetic data can reduce the need for real data collection, generating high-quality synthetic data often requires significant com-puting power. This, in turn, reduces the environmental benefits, especially if large amounts of synthetic values need to be generated frequently.

A question of balance

In a challenging data-driven economy, synthetic data represents a promising tool that leverages the power of data analytics and machine learning while mitigating some of the most pressing concerns associated with data extractivism. It is crucial to view synthetic data from a balanced perspective: while it can significantly improve data protection and reduce the environmental burden of data storage, it is not a place for all data-related problems. Challenges such as potential re-identification, the complexity of generating truly representative data, and the required computing capacity should also be carefully considered.

For the effective use of synthetic data, the development of robust methods for generation, validation, and utilisation will be central in the future. This includes further development of methods for creating more realistic and diverse synthetic datasets, implementing strong security precautions to prevent re-identification, and establishing clear ethical guidelines and regulatory frameworks for their application.