Saturday, December 6, 2025

"Synthetic Dreams, Real Frictions: Reimagining Computer-Generated Data"

A companion piece to the post immediately below, "Introducing Unified Model Collapse".

From Bot Populi, November 3:

“…asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Altman brushed it off, saying he was “pretty confident that soon all data will be synthetic data.” — Financial Times (2023)

Sam Altman’s remark that soon “all data will be synthetic” is not just a provocation—it captures how debates on synthetic data often unfold through bold claims that sidestep the pressing issues. What, in fact, are synthetic data? Why do they matter? And how can we critically understand their world-making potential, especially when viewed from the vantage point of the Global South?

This post argues that synthetic data are not neutral technical tools but socially constructed representations whose world-making potential is deeply contested. They are both fictions and frictions: fictions because they are constructed representations of reality, embedded with the assumptions of generative models; frictions because they produce practical and epistemological tensions when deployed within infrastructures such as finance. These tensions are particularly acute in Global South contexts, where synthetic data narratives often obscure existing infrastructural and governance challenges. We draw on our academic study of responses to synthetic data in Latin America to illustrate distinctions between dreams and realities as well as to push for reimagining what synthetic data are and could be.

Fictions of Synthetic Data in Finance

The European Data Protection Supervisor defines synthetic data as “artificial data that [are] generated from original data and a model that is trained to reproduce the characteristics and structure of the original data.” While this framing is technically accurate and useful for legal purposes, it misses the extent to which synthetic data are narrative-laden fictions.

Synthetic data are promoted as risk-free, privacy-preserving, and innovation-friendly. In finance, synthetic data are said to enable experimentation without endangering consumer privacy, as well as to simulate risk scenarios without exposing real data.

Yet the term “synthetic” is not an innocent descriptor. Once again in finance, ‘synthetic’ evokes the ghosts of risky instruments at the heart of global market collapse. The 2007–08 financial crisis, precipitated in part by synthetic products like collateralized debt obligations, left lasting stigmas related to synthetic products. Financial professionals approach anything synthetic with caution. It is no coincidence that the Financial Times interview with Sam Altman cited above opted for the term “computer-generated data” rather than “synthetic”. Even though the meaning and uses of what counts as synthetic have evolved since the financial crisis, the term persistently evokes worries in this sector, whether in relation to the synthetic risk transfer market or to so-called synthetic stablecoins.

In financial contexts, narratives around synthetic data are therefore marked by a persistent tension: they are presented as innovative and promising yet remain overshadowed by enduring associations with risk and opacity. This specific context prompts us to posit that synthetic data must be understood not only as technical artifacts but also as infrastructural narratives....

....MUCH MORE