From Bot Populi, May 26:
Artificial Data for Artificial Intelligence: Could This Be a Game Changer?
Will synthetic data level the playing field or exacerbate existing inequalities in the tech sector?
A significant amount of data, while remaining uncollected, theoretically exists, providing an opening for synthetic data algorithms to step in and transform how we use data. Advancements in this field rely on algorithms that tune and are tuned by what we expect would happen. This might sound like guesswork, but today we can push reality to its limit by running thousands, if not millions, of simulations.
While hitherto being the stuff of sci-fi, recent breakthroughs have driven legislators, researchers, and developers to seriously consider the implications of synthetic data technology, especially since Big Tech corporations as well as state arms like the US Army have begun deploying it.
Synthetic data has the potential to fill the maw of eternally hungry data-processing algorithms, no matter how large or complex. Not only can it bypass privacy concerns (since simulated people don’t have addresses or personal attributes unless we want them to), it also allows smaller companies like start-ups, who neither have competitive data collection capabilities nor the capacity to pay data brokers for the same to compete in a market where artificial intelligence (AI) training data commands more than a billion dollars in market share. Two years ago, MIT released the ‘Synthetic Data Vault’, designed to help those without access to data, learn and compete in the AI market.
While hitherto being the stuff of sci-fi, recent breakthroughs have driven legislators, researchers, and developers to seriously consider the implications of synthetic data tech, especially since Big Tech as well as state arms like the US Army have begun deploying it.
Indeed, we may be on the cusp of a level playing field vis-à-vis the tech sector. But for that to materialize, it’s key that we load the synthetic data dice in favor of such an outcome.
The Tech and Its Use Cases
Imagine a pair of dice. You could roll it as many times as you want and write down the numbers. But, then…why bother rolling the dice? You could also just write down a number from 1 to 6 as many times as you needed to. Instead of collecting data, you’re ‘sampling’ from your brain, creating a synthesized dataset of dice rolls. Synthesizing this data is easier than rolling and recording. Also, you likely know that the average of all rolls should be around 3.5, which would allow you to make a more accurate (although still synthesized) dataset.Now imagine this process being done ‘algorithmically’. Simply put, synthetic data is data created by a computer. Like random sampling from your brain, a competent synthetic dataset draws from huge swathes of real-life samples. Most people, even AI engineers, would be pretty hard-pressed to distinguish it from ‘real’ data.
While the sampling and generation method uses and scrambles pre-existing lists of common, available data, a more advanced and concerning method is the use of a Generative Adversarial Network (GAN). A GAN comprises two AIs working together: a generator and a discriminator. The generator produces data and the discriminator discerns if it’s possible for the data to exist in the dataset. By reinforcing each other, both algorithms acquire greater intelligence and success with prediction....
....MUCH MORE
Previously from Bot Populi: