AI and Training Data:
Over at The Atlantic, Ross Andersen asks “What Happens When AI Has Read Everything?” AI has recently demonstrated great capabilities in language translation and writing, He points out that AIs can translate between more than a hundred languages in real time, and can produce a pastiche in a range of literary styles. It can even make a stab at rhyming poetry.
But researchers warn that this could come to an end if there is not more new text to train on, and good prose is hard to find. Arbitrary text captured at random from around the web does not, in general, make good training data.
Training an AI is the equivalent of locking someone in the Library of Congress and telling them not to come out until they’ve finished a speed course in human culture. Generative AI systems are based upon access to huge volumes of training data, be it text, for something like GPT-3 or images, for something like Dall-E. Google Books researchers estimate that since Gutenberg, humans have published more than 125 million titles. An estimated ten to thirty million million of them are already digitized. This gives AIs access to hundreds of billions of words.
Andersen notes more fanciful suggestions for gathering data, including harvesting text messaging and keystrokes, and having everyone wear microphones so that their spontaneous voice output can be converted to training data. Neglecting that this sounds like a dystopic security nightmare, it’s not clear yet whether the additional data would necessarily be of much value. So far, larger, more powerful models need more data to train on, but, like Moore’s Law, it remains to be seen if that relationship will continue. If we hit the maximum of what more training data provides for us, the problem may solve itself.