Until 2019, it has been the case that if you come across several paragraphs of text on a consistent topic with consistent subjects, you can assume that text was written or structured by a human being. That is no longer true.
Over the past year, AI researchers designed computer programs with the ability to generate multi-paragraph stories that remain fairly coherent throughout. As we explain in the video above, these programs create passages of text that seem like they were written by someone who is fluent in the language but possibly faking their knowledge. I’m not sure we needed an automated version of that person, but here it is:
Depending on how you look at it, this technology is a powerful bullshit machine or a promising tool for artists. So far, the creative uses seem to outnumber the malicious ones, but it’s not difficult to imagine how text-fakes could cause harm, especially since these models have been widely shared and are now deployable by anyone with basic know-how.
The field of Natural Language Processing (NLP) didn’t exactly set out to create a fake news machine. Rather, this is the byproduct of a line of research into massive language models — machine learning programs that build vast statistical maps of the correlations between words. They look at a sample of text and guess the next word based on how frequently that word appeared in similar contexts in the training data.
That sounds simple but it’s an incredibly challenging task. They need to account for the fact that different words can have different meanings depending on the context. They need to be able to sort out which pronouns refer to which nouns. And they need to keep track of long-range dependencies, which are words whose meanings hinge on other words that are relatively far away. Since most computer models in the past were focused on the immediate context, they couldn’t continue a consistent idea or story.
That has changed for two reasons. First, these models are now “pretrained” on way more data than before, millions of articles pulled from the internet. And second, the computers are able to handle that amount of data because researchers adopted a new technique called “Transformer,” which allows for a more efficient use of computing power. The result is that the models can access more contextual information about each word and therefore make more plausible sentence predictions.
Over the past few years, pretrained language models have enabled huge strides across a number of language tasks. Text generation is key part of language translation, chatbots, question-answering, and summarization. The problem is that in their simplest form, when they’re prompted to do open-ended generation, language models are indifferent to the truth. That’s what makes them creative, but it also puts them on the wrong side of the battle against trolls, propagandists, and con artists online.
Bots roam the internet in huge numbers, primarily deceiving other computers. Now, with a decent handle on our language, they have new ways of deceiving humans directly. Certainly it’s been possible to simply hire people to write posts, fake reviews, and misinformation. What this tool adds is scale, language fluency, and the ability to mirror the jargon and writing style of any profession or, with enough samples, any individual.
The recent advances in language modeling mean that voice assistants will get better, chatbots will get better, and businesses will have better ways of analyzing documents. But for the places where humans gather online to talk to other humans, the internet will get a little worse.
You can find this video and all of Vox’s videos on YouTube. And join the Open Sourced Reporting Network to help us report on the real consequences of data, privacy, algorithms, and AI.