To better understand how neural networks learn to simulate writing, researchers trained simpler versions on synthetic children’s stories.
www.quantamagazine.org
Tiny Language Models Come of Age
To better understand how neural networks learn to simulate writing, researchers trained simpler versions on synthetic children’s stories.
Tiny Language Models Thrive With GPT-4 as a Teacher | Quanta Magazine
READ LATER
Adam Nickel for
Quanta Magazine
ByBen Brubaker
Staff Writer
October 5, 2023
Introduction
Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics has surprised researchers and the public over the past year.
But the approach has its drawbacks. For one thing, the “training” procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail.
Faced with these difficulties, some researchers have opted to train
smaller models on smaller data sets and then study their behavior. “It’s like sequencing the
Drosophila genome versus sequencing the human genome,” said
Ellie Pavlick, a language model researcher at Brown University.
Now, in a
paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories.
Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.
The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.
“I found this paper very informative,” said
Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. “The concept itself is super interesting.”
Once Upon a Time
The neural networks at the heart of language models are mathematical structures loosely inspired by the human brain. Each one contains many artificial neurons arranged in layers, with connections between neurons in adjacent layers. The neural network’s behavior is governed by the strength of these connections, called parameters. In a language model, the parameters control which words the model might spit out next, given an initial prompt and the words it has generated already.
A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameters is trivially easy to assemble from a few lines of code, but it will just produce gibberish. After training, it can often plausibly continue unfamiliar text. Larger models often undergo further fine-tuning that teaches them to answer questions and follow instructions, but the bulk of the training is mastering word prediction.
Success at word prediction requires a language model to master many different skills. For example, the rules of English grammar suggest that the next word after the word “going” is likely to be “to,” regardless of the subject of the text. In addition, a system needs factual knowledge to complete “the capital of France is,” and completing a passage containing
the word “not” requires a rudimentary grasp of logic.
“Raw language is very complicated,” said
Timothy Nguyen, a machine learning researcher at DeepMind. “In order for interesting linguistic capabilities to arise, people have resorted to ‘more data is better.’”
Ronen Eldan realized he could use children’s stories generated by large language models to rapidly train smaller ones.
Weizmann Institute of Science
Introduction
Ronen Eldan, a mathematician who joined Microsoft Research in 2022 to study generative language models, wanted to develop a cheaper and faster way to explore their abilities. The natural way to do that was by using a small data set, and that in turn meant he’d have to train models to specialize in a specific task, so they wouldn’t spread themselves too thin. Initially, he wanted to train models to solve a certain class of math problems, but one afternoon, after spending time with his 5-year-old daughter, he realized that children’s stories were a perfect fit.
“It literally came to me after I read her a story,” he said.
To generate coherent children’s stories, a language model would need to learn facts about the world, keep track of characters and events, and observe the rules of grammar — simpler versions of the challenges facing large models. But large models trained on massive data sets learn countless irrelevant details along with the rules that really matter. Eldan hoped the brevity and limited vocabulary of children’s stories might make learning more manageable for small models — making them both easier to train and easier to understand.
In language model research — as in every classroom — grading is a fraught topic.
In the world of language models, though, “small” is relative: A data set a thousand times smaller than the one used to train GPT-3.5 would still need to contain millions of stories. “I don’t know how much money you want to spend, but I’m guessing you’re not going to hire professionals to write [a couple million] short stories,” Nguyen said.
It would take an extraordinarily prolific author to satisfy such voracious readers, but Eldan had a few candidates in mind. Who better to write for an audience of small language models than large ones?
Toy Stories
Eldan immediately set out to create a library of synthetic children’s stories generated by large language models. But he soon discovered that even state-of-the-art models aren’t naturally very creative. If you just tell GPT-4 to write stories appropriate for 4-year-olds, Eldan said, “about one-fifth of the stories will be about children going to the park being scared of the slides.” That’s apparently the quintessential preschool story, as far as the internet is concerned.
The solution was to add a bit of randomness into the prompt. First, Eldan used GPT-4 to generate a list of 1,500 nouns, verbs and adjectives that a 4-year-old might know — short enough that he could easily check it himself. Then he wrote a simple computer program that would repeatedly prompt GPT-3.5 or GPT-4 to generate an age-appropriate story that included three random words from the list, along with an additional randomly chosen detail like a happy ending or plot twist. The resulting stories, mercifully, were less focused on scary slides.
Eldan now had a procedure for churning out training data on demand, but he had no idea how many stories he’d need to train a functional model, or how big that model would need to be. That’s when he teamed up with
Yuanzhi Li, a machine learning researcher at Microsoft and Carnegie Mellon University, to try different possibilities, taking advantage of the fact that small models could be trained very quickly. Step 1 was deciding how to evaluate their models.
Yuanzhi Li teamed up with Eldan to compare different models trained on synthetic children’s stories. They found that surprisingly small models could learn to tell coherent stories.