The tech industry can’t agree on what open-source AI means. That’s a problem.
The answer could determine who gets to shape the future of the technology.
www.technologyreview.com
The tech industry can’t agree on what open-source AI means. That’s a problem.
The answer could determine who gets to shape the future of the technology.By Edd Gent archive page
March 25, 2024
STEPHANIE ARNETT/MITTR | ENVATO
Suddenly, “open source” is the latest buzzword in AI circles. Meta has pledged to create open-source artificial general intelligence. And Elon Musk is suing OpenAI over its lack of open-source AI models.
Meanwhile, a growing number of tech leaders and companies are setting themselves up as open-source champions.
But there’s a fundamental problem—no one can agree on what “open-source AI” means.
On the face of it, open-source AI promises a future where anyone can take part in the technology’s development. That could accelerate innovation, boost transparency, and give users greater control over systems that could soon reshape many aspects of our lives. But what even is it? What makes an AI model open source, and what disqualifies it?
The answers could have significant ramifications for the future of the technology. Until the tech industry has settled on a definition, powerful companies can easily bend the concept to suit their own needs, and it could become a tool to entrench the dominance of today’s leading players.
Entering this fray is the Open Source Initiative (OSI), the self-appointed arbiters of what it means to be open source. Founded in 1998, the nonprofit is the custodian of the Open Source Definition, a widely accepted set of rules that determine whether a piece of software can be considered open source.
Now, the organization has assembled a 70-strong group of researchers, lawyers, policymakers, activists, and representatives from big tech companies like Meta, Google, and Amazon to come up with a working definition of open-source AI.
The open-source community is a big tent, though, encompassing everything from hacktivists to Fortune 500 companies. While there’s broad agreement on the overarching principles, says Stefano Maffulli, OSI’s executive director, it’s becoming increasingly obvious that the devil is in the details. With so many competing interests to consider, finding a solution that satisfies everyone while ensuring that the biggest companies play along is no easy task.
Fuzzy criteria
The lack of a settled definition has done little to prevent tech companies from adopting the term.Last July, Meta made its Llama 2 model, which it referred to as open source, freely available, and it has a track record of publicly releasing AI technologies. “We support the OSI’s effort to define open-source AI and look forward to continuing to participate in their process for the benefit of the open source community across the world,” Jonathan Torres, Meta’s associate general counsel for AI, open source, and licensing told us.
That stands in marked contrast to rival OpenAI, which has shared progressively fewer details about its leading models over the years, citing safety concerns. “We only open-source powerful AI models once we have carefully weighed the benefits and risks, including misuse and acceleration,” a spokesperson said.
Other leading AI companies, like Stability AI and Aleph Alpha, have also released models described as open source, and Hugging Face hosts a large library of freely available AI models.
While Google has taken a more locked-down approach with its most powerful models, like Gemini and PaLM 2, the Gemma models released last month are freely accessible and designed to go toe-to-toe with Llama 2, though the company described them as “open” rather than “open source.”
But there’s considerable disagreement about whether any of these models can really be described as open source. For a start, both Llama 2 and Gemma come with licenses that restrict what users can do with the models. That’s anathema to open-source principles: one of the key clauses of the Open Source Definition outlaws the imposition of any restrictions based on use cases.
The criteria are fuzzy even for models that don’t come with these kinds of conditions. The concept of open source was devised to ensure developers could use, study, modify, and share software without restrictions. But AI works in fundamentally different ways, and key concepts don’t translate from software to AI neatly, says Maffulli.
One of the biggest hurdles is the sheer number of ingredients that go into today’s AI models. All you need to tinker with a piece of software is the underlying source code, says Maffulli. But depending on your goal, dabbling with an AI model could require access to the trained model, its training data, the code used to preprocess this data, the code governing the training process, the underlying architecture of the model, or a host of other, more subtle details.
Which ingredients you need to meaningfully study and modify models remains open to interpretation. “We have identified what basic freedoms or basic rights we want to be able to exercise,” says Maffulli. “The mechanics of how to exercise those rights are not clear.”
Greater access to the code behind generative models is fueling innovation. But if top companies get spooked, they could close up shop.
Settling this debate will be essential if the AI community wants to reap the same benefits software developers gained from open source, says Maffulli, which was built on broad consensus about what the term meant. “Having [a definition] that is respected and adopted by a large chunk of the industry provides clarity,” he says. “And with clarity comes lower costs for compliance, less friction, shared understanding.”
By far the biggest sticking point is data. All the major AI companies have simply released pretrained models, without the data sets on which they were trained. For people pushing for a stricter definition of open-source AI, Maffulli says, this seriously constrains efforts to modify and study models, automatically disqualifying them as open source.
Others have argued that a simple description of the data is often enough to probe a model, says Maffulli, and you don’t necessarily need to retrain from scratch to make modifications. Pretrained models are routinely adapted through a process known as fine-tuning, in which they are partially retrained on a smaller, often application-specific, dataset.
Meta’s Llama 2 is a case in point, says Roman Shaposhnik, CEO of open-source AI company Ainekko and vice president of legal affairs for the Apache Software Foundation, who is involved in the OSI process. While Meta only released a pretrained model, a flourishing community of developers has been downloading and adapting it, and sharing their modifications.
“People are using it in all sorts of projects. There’s a whole ecosystem around it,” he says. “We therefore must call it something. Is it half-open? Is it ajar?”
While it may be technically possible to modify a model without its original training data, restricting access to a key ingredient is not really in the spirit of open source, says Zuzanna Warso, director of research at nonprofit Open Future, who is taking part in the OSI’s discussions. It’s also debatable whether it’s possible to truly exercise the freedom to study a model without knowing what information it was trained on.
“It’s a crucial component of this whole process,” she says. “If we care about openness, we should also care about the openness of the data.”