Large Language Models News & Discussions

bnew · Dec 18, 2023

Data poisoning: how artists are sabotaging AI to take revenge on image generators

As AI developers indiscriminately suck up online content to train their models, artists are seeking ways to fight back.

theconversation.com

T.J. Thomson, Author provided

Data poisoning: how artists are sabotaging AI to take revenge on image generators

Published: December 17, 2023 2:17pm EST

Authors
T.J. Thomson
Senior Lecturer in Visual Communication & Digital Media, RMIT University

Daniel Angus
Professor of Digital Communication, Queensland University of Technology
Disclosure statement

Imagine this. You need an image of a balloon for a work presentation and turn to a text-to-image generator, like Midjourney or DALL-E, to create a suitable image.

You enter the prompt: “red balloon against a blue sky” but the generator returns an image of an egg instead. You try again but this time, the generator shows an image of a watermelon.

What’s going on?

The generator you’re using may have been “poisoned”.

We believe good journalism is good for democracy and necessary for it.

Learn more

What is ‘data poisoning’?

Text-to-image generators work by being trained on large datasets that include millions or billions of images. Some generators, like those offered by Adobe or Getty, are only trained with images the generator’s maker owns or has a licence to use.

But other generators have been trained by indiscriminately scraping online images, many of which may be under copyright. This has led to a slew of copyright infringement cases where artists have accused big tech companies of stealing and profiting from their work.

This is also where the idea of “poison” comes in. Researchers who want to empower individual artists have recently created a tool named “Nightshade” to fight back against unauthorised image scraping.

The tool works by subtly altering an image’s pixels in a way that wreaks havoc to computer vision but leaves the image unaltered to a human’s eyes.

If an organisation then scrapes one of these images to train a future AI model, its data pool becomes “poisoned”. This can result in the algorithm mistakenly learning to classify an image as something a human would visually know to be untrue. As a result, the generator can start returning unpredictable and unintended results.

Symptoms of poisoning

As in our earlier example, a balloon might become an egg. A request for an image in the style of Monet might instead return an image in the style of Picasso.

Some of the issues with earlier AI models, such as trouble accurately rendering hands, for example, could return. The models could also introduce other odd and illogical features to images – think six-legged dogs or deformed couches.

The higher the number of “poisoned” images in the training data, the greater the disruption. Because of how generative AI works, the damage from “poisoned” images also affects related prompt keywords.

Read more: Do AI systems really have their own secret language?

For example, if a “poisoned” image of a Ferrari is used in training data, prompt results for other car brands and for other related terms, such as vehicle and automobile, can also be affected.

Nightshade’s developer hopes the tool will make big tech companies more respectful of copyright, but it’s also possible users could abuse the tool and intentionally upload “poisoned” images to generators to try and disrupt their services.

Is there an antidote?

In response, stakeholders have proposed a range of technological and human solutions. The most obvious is paying greater attention to where input data are coming from and how they can be used. Doing so would result in less indiscriminate data harvesting.

This approach does challenge a common belief among computer scientists: that data found online can be used for any purpose they see fit.

Other technological fixes also include the use of “ensemble modeling” where different models are trained on many different subsets of data and compared to locate specific outliers. This approach can be used not only for training but also to detect and discard suspected “poisoned” images.

Audits are another option. One audit approach involves developing a “test battery” – a small, highly curated, and well-labelled dataset – using “hold-out” data that are never used for training. This dataset can then be used to examine the model’s accuracy.

Strategies against technology

So-called “adversarial approaches” (those that degrade, deny, deceive, or manipulate AI systems), including data poisoning, are nothing new. They have also historically included using make-up and costumes to circumvent facial recognition systems.

Human rights activists, for example, have been concerned for some time about the indiscriminate use of machine vision in wider society. This concern is particularly acute concerning facial recognition.

Systems like Clearview AI, which hosts a massive searchable database of faces scraped from the internet, are used by law enforcement and government agencies worldwide. In 2021, Australia’s government determined Clearview AI breached the privacy of Australians.

Read more: Australian police are using the Clearview AI facial recognition system with no accountability

In response to facial recognition systems being used to profile specific individuals, including legitimate protesters, artists devised adversarial make-up patterns of jagged lines and asymmetric curves that prevent surveillance systems from accurately identifying them.

There is a clear connection between these cases and the issue of data poisoning, as both relate to larger questions around technological governance.

Many technology vendors will consider data poisoning a pesky issue to be fixed with technological solutions. However, it may be better to see data poisoning as an innovative solution to an intrusion on the fundamental moral rights of artists and users.

Artificial intelligence (AI)

DALL-E 2

MidJourney

Generative AI

bnew · Dec 18, 2023

Nvidia Staffers Warned CEO of Threat AI Would Pose to Minorities

As the chipmaker’s AI technology has become ubiquitous, it’s working to make it more inclusive

Nvidia Chief Executive Officer Jensen Huang met with employees in 2020 over risks posed by artificial intelligence.

Photographer: I-Hwa Cheng/Bloomberg

By Sinduja Rangarajan and Ian King

December 18, 2023 at 6:00 AM EST

Masheika Allgood and Alexander Tsado left their 2020 meeting with Nvidia Corp. Chief Executive Officer Jensen Huang feeling frustrated.

The pair, both former presidents of the company’s Black employees group, had spent a year working with colleagues from across the company on a presentation meant to warn Huang of the potential dangers that artificial intelligence technology posed, especially to minorities.

The 22-slide deck and other documents, reviewed by Bloomberg News, pointed to Nvidia’s growing role in shaping the future of AI — saying its chips were making AI ubiquitous — and warned that increased regulatory scrutiny was inevitable. The discussion included instances of bias in facial-recognition technologies used by the industry to power self-driving cars. Their aim, the pair told Bloomberg, was to find a way to confront the potentially perilous unintended consequences of AI head-on — ramifications that would likely be first felt by marginalized communities.

According to Allgood and Tsado, Huang did most of the talking during the meeting. They didn’t feel he really listened to them and, more importantly, didn’t get a sense that Nvidia would prioritize work on addressing potential bias in AI technology that could put underrepresented groups at risk.

Tsado, who was working as a product marketing manager, told Bloomberg News that he wanted Huang to understand that the issue needed to be tackled immediately — that the CEO might have the luxury of waiting, but “I am a member of the underserved communities, and so there’s nothing more important to me than this. We’re building these tools and I’m looking at them and I’m thinking, this is not going to work for me because I’m Black.’’

Masheika Allgood and Alexander Tsado.Photographer: David Odisho/Bloomberg

Both Allgood and Tsado quit the company shortly afterwards. Allgood’s decision to leave her role as a software product manager, she said, was because Nvidia “wasn’t willing to lead in an area that was very important to me.” In a LinkedIn post, she called the meeting “the single most devastating 45 minutes of my professional life.”

While Allgood and Tsado have departed, the concerns they raised about making AI safe and inclusive still hang over the company, and the AI industry at large. The chipmaker has one of the poorest records among big tech companies when it comes to Black and Hispanic representation in its workforce, and one of its generative AI products came under criticism for its failure to account for people of color.

The matters raised by Allgood and Tsado, meantime, also have resonated. Though Nvidia declined to comment on the specifics of the meeting, the company said it “continues to devote tremendous resources to ensuring that AI benefits everyone.”

“Achieving safe and trustworthy AI is a goal we’re working towards with the community,” Nvidia said in a statement. “That will be a long journey involving many discussions.”

One topic of the meeting isn’t in dispute. Nvidia has become absolutely central to the explosion in deployment of artificial intelligence systems. Sales of its chips, computers and associated software have taken off, sending its shares on an unprecedented rally. It’s now the world’s only chipmaker with a trillion-dollar market value.

What was once a niche form of computing is making its way into everyday life in the form of advanced chatbots, self-driving cars and image recognition. And AI models — which analyze existing troves of data to make predictions aimed at replicating human intelligence — are under development to be used in everything from drug discovery and industrial design to the advertising, military and security industries. With that proliferation, the concern about the risks it poses has only grown. Models are usually trained on massive datasets created by gathering information and visuals from across the internet.

As AI evolves into a technology that encroaches deeper into daily life, some Silicon Valley workers aren’t embracing it with the same level of trust that they’ve shown with other advances. Huang and his peers are likely to keep facing calls from workers who feel they need to be heard.

And while Silicon Valley figures such as Elon Musk have expressed fears about AI’s potential threat to human existence, some underrepresented minorities say they have a far more immediate set of problems. Without being involved in the creation of the software and services, they worry that self-driving cars might not stop for them, or that security cameras will misidentify them.

“The whole point of bringing diversity into the workplace is that we are supposed to bring our voices and help companies build tools that are better suited for all communities,’’ said Allgood. During the meeting, Allgood said she raised concerns that biased facial-recognition technologies used to power self-driving cars could pose greater threats to minorities. Huang replied that the company would limit risk by testing vehicles on the highway, rather than city streets, she said.

Alexander Tsado.Photographer: David Odisho/Bloomberg

The lack of diversity and its potential impact is particularly relevant at Nvidia. Only one out of a sample of 88 S&P 100 companies ranked lower than Nvidia based on their percentages of Black and Hispanic employees in 2021, according to data compiled by Bloomberg from the US Equal Employment Opportunity Commission. Of the five lowest-ranked companies for Black employees, four are chipmakers: Advanced Micro Devices Inc., Broadcom Inc., Qualcomm Inc. and Nvidia. Even by tech standards — the industry has long been criticized for its lack of diversity — the numbers are low.

Read More: Corporate America Promised to Hire a Lot More People of Color. It Actually Did.

During the meeting, Allgood recalled Huang saying that the diversity of the company would ensure that its AI products were ethical. At that time, only 1% of Nvidia employees were Black — a number that hadn’t changed from 2016 until then, according to data compiled by Bloomberg. That compared with 5% at both Intel Corp. and Microsoft Corp., 4% at Meta Platforms Inc. and 14% for the Black share of the US population overall in 2020, the data showed. People with knowledge of the meeting who asked not to be identified discussing its contents said Huang meant diversity of thought, rather than specifically race.

According to Nvidia, a lot has happened since Allgood and Tsado met with the CEO. The company says it has done substantial work to make its AI-related products fair and safe for everyone. AI models that it supplies to customers come with warning labels, and it vets the underlying datasets to remove bias. It also seeks to ensure that AI, once deployed, remains focused on its intended purpose.

In emails dated March 2020 reviewed by Bloomberg, Huang did give the go-ahead for work to start on some of Allgood’s proposals, but by that time she’d already handed in her notice.

Not long after Allgood and Tsado left Nvidia, the chipmaker hired Nikki Pope to lead its in-house Trustworthy AI project. Co-author of a book on wrongful convictions and incarcerations, Pope is head of what’s now called Nvidia’s AI & Legal Ethics program.

Rivals Alphabet Inc.’s Google and Microsoft had already set up similar AI ethics teams a few years earlier. Google publicly announced its “AI principles” in 2018 and has given updates on its progress. Microsoft had a team of 30 engineers, researchers and philosophers on its AI ethics team in 2020, some of whom it laid off this year.

Pope, who’s Black, said she doesn’t accept the assertion that minorities have to be involved directly to be able to produce unbiased models. Nvidia examines datasets that software is trained on, she said, and makes sure that they’re inclusive enough.

“I’m comfortable that the models that we provide for our customers to use and modify have been tested, that the groups who are going to be interacting with those models have been represented,” Pope said in an interview.

The company has created an open-source platform, called NeMo Guardrails, to help chatbots filter out unwanted content and stay on topic. Nvidia now releases “model cards” with its AI models, which provide more details on what a model does and how it’s made, as well as its intended use and limitations.

Nvidia also collaborates with internal affinity groups to diversify its datasets and test the models for biases before release. Pope said datasets for self-driving cars are now trained on images that include parents with strollers, people in wheelchairs and darker-skinned people.

bnew · Dec 18, 2023

Nvidia Ranks Close to the Bottom in Diverse Hiring

Company and most of its chipmaker peers lag rest of technology industry

Black, Hispanic and other races as a percentage of US workforce

Source: 2021 EEO-1 Filings compiled by Bloomberg

Note: Bloomberg is using “other races” to refer to employees who self-report as “Native Hawaiian or Other Pacific Islander,” “American Indian or Alaska Native,” or “two or more races.”

Pope and colleague Liz Archibald, who is director of corporate communications at Nvidia and also Black, said that they once had a “tough meeting” with Huang over AI transparency and safety. But they felt like his questions brought more rigor to their work.

“I think his end goal was to pressure-test our arguments and probe the logic to help figure out how he could make it even better for the company as a whole,” Archibald said in an email.

Some researchers say that minorities are so underrepresented in tech, and particularly in AI, that without their input, algorithms are likely to have blind spots. A paper from New York University’s AI Now Institute has linked a lack of representation in the AI workforce to bias in models, calling it a “diversity disaster.”

In 2020, researchers from Duke University set out to create software that would convert blurry pictures into high-resolution images, using a large language model from Nvidia called StyleGAN, which was developed to produce fake but hyperreal-looking human faces and trained on a dataset of images from photo site Flickr. When users played around with the tool, they found it struggled with low-resolution photos of people of color — including former President Barack Obama and Congresswoman Alexandria Ocasio-Cortez — inadvertently generating images of faces with lighter skin tones and eye colors. The researchers later said the bias likely came out of Nvidia’s model and updated their software.

Nvidia mentions in its code archives that its version of the dataset was collected from Flickr and inherits “all the biases of that website.” In 2022, it added that the dataset should not be used for “development or improvement of facial recognition technologies.”

The model that was criticized has been superseded by a new one, according to Pope.

Nvidia joins a list of large companies where some minority employees have expressed concern that the new technology carries dangers, particularly for people of color. Timnit Gebru, an AI ethics researcher, left Google after the company wanted her to retract her paper that warned of the dangers of training AI models (Gebru said Google fired her; the company said she resigned). She has said that any methodology that uses datasets “too large to document were inherently risky,” as reported by the MIT Technology Review.

Gebru and Joy Buolamwini, founder of the Algorithmic Justice League, published a paper called “Gender Shades” that showed how facial recognition technologies make errors at higher rates when identifying women and people of color. A growing number of studies now support their research that underlying datasets used to power AI models are biased and are capable of harming minorities. International Business Machines Corp, Microsoft and Amazon.com Inc. have stopped selling facial recognition technologies to police departments.

Read More: Humans Are Biased. Generative AI Is Even Worse

“If you look within the history of the tech industry, it’s not a beacon for being reflective of serious commitment to diversity,” said Sarah Myers West, the managing director of AI Now Institute and a co-author of the paper on lack of diversity in the AI workforce. The industry has a long history of not taking minorities and their concerns seriously, she said.

Nvidia’s head of human resources, Shelly Cerio, told Bloomberg that while the company was functioning like a startup — and worrying about surviving — it hired primarily to meet its immediate skills needs: as many engineers with higher degrees as it could find. Now that it’s larger, Nvidia has made diversity in its recruitment more of a priority.

“Have we made progress? Yes,” she said. “Have we made enough progress? Absolutely not.”

Masheika Allgood.Photographer: David Odisho/Bloomberg

The company improved its hiring of Black employees after 2020. Black representation grew from 1.1% in 2020 to 2.5% in 2021, the most recent year that data is available. Asians are the largest ethnic group at the company, followed by White employees.

Pope said all of the company’s efforts don’t “guarantee or eliminate” bias, but do provide a diversified dataset that can help address them. She said that in a fast-paced company that has released hundreds of models, scaling up her processes to address safety is one of the challenges of her role.

It also will take years to tell whether this work will be enough to keep AI systems safe in the real world. Self-driving cars, for example, are still rare.

A few weeks before Allgood left the company, she wrote one last email to Huang reflecting on when she had worked as a teacher in her previous career. She wrote that when she took her students on field trips, she relied on parents and volunteers to help her manage them — an acknowledgement that no one, no matter how brilliant, could handle a group of kids in the wild.

“AI has permanently moved into the field trip stage,” read the email. “You need colleagues and a structure to manage the chaos.”

— With assistance from Jeff Green

bnew · Dec 18, 2023

OpenAI Says Board Can Overrule CEO on Safety of New AI Releases

The arrangement was mentioned in a set of guidelines released Monday explaining how the ChatGPT-maker plans to deal with AI risks.

www.bloomberg.com

OpenAI Says Board Can Overrule CEO on Safety of New AI Releases

The arrangement was mentioned in a set of guidelines released Monday explaining how the ChatGPT-maker plans to deal with AI risks.

Sam Altman, chief executive officer of OpenAI, at the Hope Global Forums annual meeting in Atlanta, Georgia, US, on Monday, Dec. 11, 2023.
Photographer: Dustin Chambers/Bloomberg

By Rachel Metz
December 18, 2023 at 1:03 PM EST

OpenAI said its board can choose to hold back the release of an AI model even if the company’s leadership has deemed it safe, another sign of the artificial intelligence startup empowering its directors to bolster safeguards for developing the cutting-edge technology.

The arrangement was spelled out in a set of guidelines released Monday explaining how the ChatGPT-maker plans to deal with what it may deem to be extreme risks from its most powerful AI systems. The release of the guidelines follows a period of turmoil at OpenAI after Chief Executive Officer Sam Altman was briefly ousted by the board, putting a spotlight on the balance of power between directors and the company’s c-suite.

OpenAI’s recently announced “preparedness” team said it will continuously evaluate its AI systems to figure out how they fare across four different categories — including potential cybersecurity issues as well as chemical, nuclear and biological threats — and work to lessen any hazards the technology appears to pose. Specifically, the company is monitoring for what it calls “catastrophic” risks, which it defines in the guidelines as “any risk which could result in hundreds of billions of dollars in economic damage or lead to the severe harm or death of many individuals.”

Aleksander Madry, who is leading the preparedness group and is on leave from a faculty position at the Massachusetts Institute of Technology, told Bloomberg News his team will send a monthly report to a new internal safety advisory group. That group will then analyze Madry’s team’s work and send recommendations to Altman and the company’s board, which was overhauled after ousting the CEO. Altman and his leadership team can make a decision about whether to release a new AI system based on these reports, but the board has the right to reverse that decision, according to the document.

OpenAI announced the formation of the “preparedness” team in October, making it one of three separate groups overseeing AI safety at the startup. There’s also “safety systems,” which looks at current products such as GPT-4, and “superalignment,” which focuses on extremely powerful — and hypothetical — AI systems that may exist in the future.

Madry said his team will repeatedly evaluate OpenAI’s most advanced, unreleased AI models, rating them “low,” “medium,” “high,” or “critical” for different types of perceived risks. The team will also make changes in hopes of reducing potential dangers they spot in AI and measure their effectiveness. OpenAI will only roll out models that are rated “medium” or “low,” according to the new guidelines.

“AI is not something that just happens to us that might be good or bad,” Madry said. “It’s something we’re shaping.”

Madry said he hopes other companies will use OpenAI’s guidelines to evaluate potential risks from their AI models as well. The guidelines, he said, are a formalization of many processes OpenAI followed previously when evaluating AI technology it has already released. He and his team came up with the details over the past couple months, he said, and got feedback from others within OpenAI.

bnew · Dec 18, 2023

https://archive.is/OVtn3

https://archive.is/eRbjU

bnew · Dec 19, 2023

https://platform.openai.com/docs/guides/prompt-engineering

Prompt engineering

This guide shares strategies and tactics for getting better results from large language models (sometimes referred to as GPT models) like GPT-4. The methods described here can sometimes be deployed in combination for greater effect. We encourage experimentation to find the methods that work best for you.

Some of the examples demonstrated here currently work only with our most capable model,

Code:

gpt-4

. In general, if you find that a model fails at a task and a more capable model is available, it's often worth trying again with the more capable model.

You can also explore example prompts which showcase what our models are capable of:

https://platform.openai.com/examples

Prompt examples

Explore prompt examples to learn what GPT models can do

Six strategies for getting better results

Write clear instructions

These models can’t read your mind. If outputs are too long, ask for brief replies. If outputs are too simple, ask for expert-level writing. If you dislike the format, demonstrate the format you’d like to see. The less the model has to guess at what you want, the more likely you’ll get it.

Tactics:

Include details in your query to get more relevant answers

Ask the model to adopt a persona

Use delimiters to clearly indicate distinct parts of the input

Specify the steps required to complete a task

Provide examples

Specify the desired length of the output

Provide reference text

Language models can confidently invent fake answers, especially when asked about esoteric topics or for citations and URLs. In the same way that a sheet of notes can help a student do better on a test, providing reference text to these models can help in answering with fewer fabrications.

Tactics:

Instruct the model to answer using a reference text

Instruct the model to answer with citations from a reference text

Split complex tasks into simpler subtasks

Just as it is good practice in software engineering to decompose a complex system into a set of modular components, the same is true of tasks submitted to a language model. Complex tasks tend to have higher error rates than simpler tasks. Furthermore, complex tasks can often be re-defined as a workflow of simpler tasks in which the outputs of earlier tasks are used to construct the inputs to later tasks.

Tactics:

Use intent classification to identify the most relevant instructions for a user query

For dialogue applications that require very long conversations, summarize or filter previous dialogue

Summarize long documents piecewise and construct a full summary recursively

Give the model time to "think"

If asked to multiply 17 by 28, you might not know it instantly, but can still work it out with time. Similarly, models make more reasoning errors when trying to answer right away, rather than taking time to work out an answer. Asking for a "chain of thought" before an answer can help the model reason its way toward correct answers more reliably.

Tactics:

Instruct the model to work out its own solution before rushing to a conclusion

Use inner monologue or a sequence of queries to hide the model's reasoning process

Ask the model if it missed anything on previous passes

Use external tools

Compensate for the weaknesses of the model by feeding it the outputs of other tools. For example, a text retrieval system (sometimes called RAG or retrieval augmented generation) can tell the model about relevant documents. A code execution engine like OpenAI's Code Interpreter can help the model do math and run code. If a task can be done more reliably or efficiently by a tool rather than by a language model, offload it to get the best of both.

Tactics:

Use embeddings-based search to implement efficient knowledge retrieval

Use code execution to perform more accurate calculations or call external APIs

Give the model access to specific functions

Test changes systematically

Improving performance is easier if you can measure it. In some cases a modification to a prompt will achieve better performance on a few isolated examples but lead to worse overall performance on a more representative set of examples. Therefore to be sure that a change is net positive to performance it may be necessary to define a comprehensive test suite (also known an as an "eval").

Tactic:

Evaluate model outputs with reference to gold-standard answers

Tactics

Each of the strategies listed above can be instantiated with specific tactics. These tactics are meant to provide ideas for things to try. They are by no means fully comprehensive, and you should feel free to try creative ideas not represented here.

Strategy: Write clear instructions

Tactic: Include details in your query to get more relevant answers

In order to get a highly relevant response, make sure that requests provide any important details or context. Otherwise you are leaving it up to the model to guess what you mean.

CONTINUE READING ON SITE....

bnew · Dec 20, 2023

https://www.reuters.com/technology/ai-cannot-be-patent-inventor-uk-supreme-court-rules-landmark-case-2023-12-20/

AI cannot be patent 'inventor', UK Supreme Court rules in landmark case

Reuters

December 20, 20239:31 AM ESTUpdated an hour ago

Illustration shows miniature of robot and toy hand

Words reading "Artificial intelligence AI", miniature of robot and toy hand are pictured in this illustration taken December 14, 2023. REUTERS/Dado Ruvic/Illustration Acquire Licensing Rights

LONDON, Dec 20 (Reuters) - A U.S. computer scientist on Wednesday lost his bid to register patents over inventions created by his artificial intelligence system in a landmark case in Britain about whether AI can own patent rights.

Stephen Thaler wanted to be granted two patents in the UK for inventions he says were devised by his "creativity machine" called DABUS.

His attempt to register the patents was refused by the UK's Intellectual Property Office (IPO) on the grounds that the inventor must be a human or a company, rather than a machine.

Thaler appealed to the UK's Supreme Court, which on Wednesday unanimously rejected his appeal as under UK patent law "an inventor must be a natural person".

Judge David Kitchin said in the court's written ruling that the case was "not concerned with the broader question whether technical advances generated by machines acting autonomously and powered by AI should be patentable".

Thaler's lawyers said in a statement that the ruling "establishes that UK patent law is currently wholly unsuitable for protecting inventions generated autonomously by AI machines and as a consequence wholly inadequate in supporting any industry that relies on AI in the development of new technologies".

'LEGITIMATE QUESTIONS'

A spokesperson for the IPO welcomed the decision "and the clarification it gives as to the law as it stands in relation to the patenting of creations of artificial intelligence machines".

They added that there are "legitimate questions as to how the patent system and indeed intellectual property more broadly should handle such creations" and the government will keep this area of law under review.

Thaler earlier this year lost a similar bid in the United States, where the Supreme Court declined to hear a challenge to the U.S. Patent and Trademark Office's refusal to issue patents for inventions created by his AI system.

Giles Parsons, a partner at law firm Browne Jacobson, who was not involved in the case, said the UK Supreme Court's ruling was unsurprising.

"This decision will not, at the moment, have a significant effect on the patent system," he said. "That's because, for the time being, AI is a tool, not an agent.

"I do expect that will change in the medium term, but we can deal with that problem as it arises."

Rajvinder Jagdev, an intellectual property partner at Powell Gilbert, said the ruling followed similar decisions by courts in Europe, Australia and the U.S. and has "given certainty that inventors must be a natural person."

But he added: "The judgment does not preclude a person using an AI to devise an invention – in such a scenario, it would be possible to apply for a patent provided that person is identified as the inventor."

In a separate case last month, London's High Court ruled that artificial neural networks can attract patent protection under UK law.

Reporting by Sam Tobin; editing by Kylie MacLellan, Jason Neely and Louise Heavens

bnew · Dec 20, 2023

Bloomberg - Are you a robot?

Large AI Dataset Has Over 1,000 Child Abuse Images, Researchers Find

The dataset has been used to build popular AI image generators, including Stable Diffusion.

By Davey Alba and Rachel Metz

December 20, 2023 at 7:00 AM EST

A massive public dataset used to build popular artificial intelligence image generators contains at least 1,008 instances of child sexual abuse material, a new report from the Stanford Internet Observatory found.

LAION-5B, which contains more than 5 billion images and related captions from the internet, may also include thousands of additional pieces of suspected child sexual abuse material, or CSAM, according to the report. The inclusion of CSAM in the dataset could enable AI products built on this data — including image generation tools like Stable Diffusion — to create new, and potentially realistic, child abuse content, the report warned.

The rise of increasingly powerful AI tools has raised alarms in part because these services are built with troves of online data — including public datasets such as LAION-5B — that can contain copyrighted or harmful content. AI image generators, in particular, rely on datasets that include pairs of images and text descriptions to determine a wide range of concepts and create pictures in response to prompts from users.

In a statement, a spokesperson for LAION, the Germany-based nonprofit behind the dataset, said the group has a “zero tolerance policy” for illegal content and was temporarily removing LAION datasets from the internet “to ensure they are safe before republishing them.” Prior to releasing its datasets, LAION created and published filters for spotting and removing illegal content from them, the spokesperson said.

Christoph Schuhmann, LAION’s founder, previously told Bloomberg News that he was unaware of any child nudity in the dataset, though he acknowledged he did not review the data in great depth. If notified about such content, he said, he would remove links to it immediately.

A spokesperson for Stability AI, the British AI startup that funded and popularized Stable Diffusion, said the company is committed to preventing the misuse of AI and prohibits the use of its image models for unlawful activity, including attempts to edit or create CSAM. “This report focuses on the LAION-5B dataset as a whole,” the spokesperson said in a statement. “Stability AI models were trained on a filtered subset of that dataset. In addition, we fine-tuned these models to mitigate residual behaviors.”

LAION-5B, or subsets of it, have been used to build multiple versions of Stable Diffusion. A more recent version of the software, Stable Diffusion 2.0, was trained on data that substantially filtered out “unsafe” materials in the dataset, making it much more difficult for users to generate explicit images. But Stable Diffusion 1.5 does generate sexually explicit content and is still in use in some corners of the internet. The spokesperson said Stable Diffusion 1.5 was not released by Stability AI, but by Runway, an AI video startup that helped create the original version of Stable Diffusion. Runway said it was released in collaboration with Stability AI.

“We have implemented filters to intercept unsafe prompts or unsafe outputs when users interact with models on our platform,” the Stability AI spokesperson added. “We have also invested in content labeling features to help identify images generated on our platform. These layers of mitigation make it harder for bad actors to misuse AI.”

LAION-5B was released in 2022 and relies on raw HTML code collected by a California nonprofit to locate images around the web and associate them with descriptive text. For months, rumors that the dataset contained illegal images have circulated in discussion forums and on social media.

“As far as we know, this is the first attempt to actually quantify and validate concerns,” David Thiel, chief technologist of the Stanford Internet Observatory, said in an interview with Bloomberg News.

For their report, Stanford Internet Observatory researchers detected the CSAM material by looking for different kinds of hashes, or digital fingerprints, of such images. The researchers then validated them using APIs dedicated to finding and removing known images of child exploitation, as well as by searching for similar images in the dataset.

Much of the suspected CSAM content that the Stanford Internet Observatory found was validated by third parties like Canadian Centre for Child Protection and through a tool called PhotoDNA, developed by Microsoft Corp., according to the report. Given that the Stanford Internet Observatory researchers could only work with a limited portion of high-risk content, additional abusive content likely exists in the dataset, the report said.

While the amount of CSAM present in the dataset doesn’t indicate that the illicit material “drastically” influences the images churned out by AI tools, Thiel said it does likely still have an impact. “These models are really good at being able to learn concepts from a small number of images,” he said. “And we know that some of these images are repeated, potentially dozens of times in the dataset.”

Stanford Internet Observatory’s work previously found that generative AI image models can produce CSAM, but that work assumed the AI systems were able to do so by combining two “concepts,” such as children and sexual activity. Thiel said the new research suggests these models might generate such illicit images because of some of the underlying data on which they were built. The report recommends that models based on Stable Diffusion 1.5 “should be deprecated and distribution ceased wherever feasible.”

— With assistance from Marissa Newman and Aggi Cantrill[/SIZE]

bnew · Dec 20, 2023

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material

The model is a massive part of the AI-ecosystem, used by Google and Stable Diffusion. The removal follows discoveries made by Stanford researchers, who found thousands instances of suspected child sexual abuse material in the dataset.

www.404media.co

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material

SAMANTHA COLE

·DEC 20, 2023 AT 7:00 AM

The model is a massive part of the AI-ecosystem, used by Google and Stable Diffusion. The removal follows discoveries made by Stanford researchers, who found thousands instances of suspected child sexual abuse material in the dataset.[/SIZE]

COLLAGE BY 404 MEDIA / IMAGES VIA PEXELS

Become a paid subscriber for unlimited, ad-free articles and access to bonus content. This site is funded by subscribers and you will be directly powering our journalism.

This piece is published with support from The Capitol Forum.

The LAION-5B machine learning dataset used by Google, Stable Diffusion, and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.

LAION told 404 Media on Tuesday that out of “an abundance of caution,” it was taking down its datasets temporarily “to ensure they are safe before republishing them."

According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.

“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” the paper says. “While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims.”

The finding highlights the danger of largely indiscriminate scraping of the internet for the purposes of generative artificial intelligence.

Large-scale Artificial Intelligence Open Network, or LAION, is a non-profit organization that creates open-source tools for machine learning. LAION-5B is one of its biggest and most popular products. It is made up of more than five billion links to images scraped from the open web, including user-generated social media platforms, and is used to train the most popular AI generation models currently on the market. Stable Diffusion, for example, uses LAION-5B, and Stability AI funded its development.

“If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it,” David Thiel, lead author of the study and Chief Technologist at the Stanford Internet Observatory told 404 Media.

Public chats from LAION leadership in the organization’s official Discord server show that they were aware of the possibility of CSAM being scraped into their datasets as far back as 2021.

“I guess distributing a link to an image such as child porn can be deemed illegal,” LAION lead engineer Richard Vencu wrote in response to a researcher asking how LAION handles potential illegal data that might be included in the dataset. “We tried to eliminate such things but there’s no guarantee all of them are out.”

SCREENSHOT VIA THE LAION DISCORD

Most institutions in the US, including Thiel’s team, aren’t legally allowed to view CSAM in order to verify it themselves. To do CSAM research, experts often rely on perceptual hashing, which extracts a unique digital signature, or fingerprint, from an image or video. PhotoDNA is a technology that creates unique hashes for images of child exploitation in order to find those images elsewhere on the web and get them removed or pursue abusers or proliferators.

“With the goal of quantifying the degree to which CSAM is present in the training dataset as well as eliminating it from both LAION‐5B and derivative datasets, we use various complementary techniques to identify potential CSAM in the dataset: perceptual hash‐based detection, cryptographic hash‐based detection, and nearest‐neighbors analysis leveraging the image embeddings in the dataset itself,” the paper says. Through this process, they identified at least 2,000 dataset entries of suspected CSAM, and confirmed those entries with third parties.

To do their research, Thiel said that he focused on URLs identified by LAION’s safety classifier as “not safe for work” and sent those URLs to PhotoDNA. Hash matches indicate definite, known CSAM, and were sent to the Project Arachnid Shield API and validated by Canadian Centre for Child Protection, which is able to view, verify, and report those images to the authorities. Once those images were verified, they could also find “nearest neighbor” matches within the dataset, where related images of victims were clustered together.

LAION could have used a method similar to this before releasing the world’s largest AI training dataset, Thiel said, but it didn’t. “[LAION] did initially use CLIP to try and filter some things out, but it does not appear that they did that in consultation with any child safety experts originally. It was good that they tried. But the mechanisms they used were just not super impressive,” Thiel said. “They made an attempt that was not nearly enough, and it is not how I would have done it if I were trying to design a safe system.”

A spokesperson for LAION told 404 Media in a statement about the Stanford paper:

"LAION is a non-profit organization that provides datasets, tools and models for the advancement of machine learning research. We are committed to open public education and the environmentally safe use of resources through the reuse of existing datasets and models. LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. We collaborate with universities, researchers and NGOs to improve these filters and are currently working with the Internet Watch Foundation (IWF) to identify and remove content suspected of violating laws. We invite Stanford researchers to join LAION to improve our datasets and to develop efficient filters for detecting harmful content. LAION has a zero tolerance policy for illegal content and in an abundance of caution, we are temporarily taking down the LAION datasets to ensure they are safe before republishing them."

This study follows a June paper by Stanford that examined the landscape of visual generative models that could be used to create CSAM. Thiel told me he continued to pursue the topic after a tip from AI researcher Alex Champandard, who found a URL of an image in LAION-5B on Hugging Face that was captioned with a phrase in Spanish that appeared to describe child exploitation material. LAION-5B is available for download from Hugging Face as an open-source tool.

Champandard told me he noticed a report to Hugging Face on LAION-5B in August 2022, flagging “an example that describes something related to pedophilia.” One of the engineers who worked on LAION-5B responded in March 2023, saying the link was dead but they’d removed it anyway because the caption was inappropriate.

“It took 7 months for that report to get dealt with by Hugging Face or LAION — which I found to be highly questionable,” Champandard said.

Following Champandard’s tweets, Hugging Face’s chief ethics scientist Margaret Mitchell wrote on Mastodon: “I just wanted to pop in to say that there has been a lot of time and energy spent on trying to find CSAM, and none has been found. Some people at HF are being attacked as if pedophiles but it's just...inappropriate cruelty.”

I asked Hugging Face whether, in light of this study and before LAION removed the datasets themselves, it would take action against datasts that were found to have links to CSAM. A spokesperson for the company said, "Yes."

"Datasets cannot be seen by Hugging Face staff (nor anyone accessing the Hub) until they are uploaded, and the uploader can decide to make the content public. Once shared, the platform runs content scanning to identify potential issues. Users are responsible for uploading and maintaining content, and staff addresses issues following the Hugging Face platform’s content guidelines, which we continue to adapt. The platform relies on a combination of technical content analysis to validate that the guidelines are indeed followed, community moderation, and reporting features to allow users to raise concerns. We monitor reports and take actions when infringing content is flagged," the Hugging Face spokesperson said. "Critical to this discussion is noting that the LAION-5B dataset contains URLs to external content, not images, which poses additional challenges. We are working with civil society and industry partners to develop good practices to handle these kinds of cross-platform questions."

The Stanford paper says that the material detected during their process is “inherently a significant undercount due to the incompleteness of industry hash sets, attrition of live hosted content, lack of access to the original LAION reference image sets, and the limited accuracy of ‘unsafe’ content classifiers.”

bnew · Dec 20, 2023

HOW DID THIS HAPPEN?

Child abuse material likely got into LAION because the organization compiled the dataset using tools that scrape the web, and CSAM isn’t relegated to the realm of the “dark web,” but proliferates on the open web and on many mainstream platforms. In 2022, Facebook made more than 21 million reports of CSAM to the National Center for Missing and Exploited Children (NCMEC) tipline, while Instagram made 5 million reports, and Twitter made 98,050.

In the US, electronic service providers [ESP] are required by law to report “apparent child pornography” to NCMEC’s CyberTipline when they become aware of them, but “there are no legal requirements for proactive efforts to detect this content or what information an ESP must include in a CyberTipline report,” according to NCMEC. A dataset, however, is different from a website, even if it is composed of data from a huge number of websites.

“Because it's the internet, there are going to be datasets that have child porn. Twitter's got it. You know, Facebook has it. It's all sitting there. They don't do a good job of policing for it, even though they claim that they do. And that's now going to be used to train these models,” Marcus Rogers, Assistant Dean for Cybersecurity Initiatives at Purdue University, told 404 Media. Organizations building datasets, however, may be intentionally ignoring the possibility that CSAM could pollute their models, he said. “Companies just don't want to know. Some of it is just, even if they wanted to know they literally have lost control of everything.”

“I think the reason that they probably ignore it is because they don't have a solution,” Bryce Westlake, an associate professor in the Department of Justice Studies and a faculty member of the department's Forensic Science program, told 404 Media. “So they don't want to bring attention to it. Because if they bring attention to it, then something's going to have to be done about that.” The interventions dataset creators could make would be labor intensive, he said, and even with those efforts in place they might not rid the sets of all of it, he said. “It's impossible for them to get rid of all of it. The only answer that society will accept is that you have 0% in there, and it's impossible to do. They're in a no win situation, so they think it's better that people just don't know."

HOW CSAM IN DATASETS AFFECTS REAL PEOPLE

In a dataset of five billion entries, 3,226 might seem like a drop in an ocean of data. But there are several ways CSAM scraped into LAION’s datasets could make things worse for real-life victims.

Dan Sexton, chief technology officer at the UK-based Internet Watch Foundation, told me that the goal for internet safety groups is to prevent more people from viewing or spreading abusive content and to get it offline entirely. We spoke months before Stanford’s paper came out, when we didn’t know for sure that child abuse material was being scraped into large datasets. “[Victims] knowing that their content is in a dataset that's allowing a machine to create other images—which have learned from their abuse—that's not something I think anyone would have expected to happen, but it's clearly not a welcome development. For any child that's been abused and their imagery circulated, excluding it anywhere on the internet, including datasets, is massive,” he said. [/SIZE]

"There's no reason that images of children being sexually abused should ever be in those datasets"

Lloyd Richardson, director of information technology at the Canadian Centre for Child Protection (C3P), told me that he imagines past victims of child sexual abuse would be “absolutely disgusted, no doubt, but probably not necessarily surprised” to learn their images are linked in a dataset like LAION-5B. “They've known for so long that they've had to deal with their images or images and videos circulating on the internet. Some reasonable technical things that could be done for the last well-over a decade, they just haven't been done right,” he said.

“I don't think anyone wants to create a tool that creates images of children being sexually abused, even if it's accidental,” Sexton said. “AI is all about having good data, and if you put bad data in, you're going to get bad data out. Of course, this is bad data. You don't want to generate or scrape images of child sexual abuse.”

Until now, it’s been theorized that AI models that are capable of creating child sexual abuse imagery were combining concepts of explicit adult material and non-explicit images of children to create AI-generated CSAM. According to Stanford’s report, real abuse imagery is helping train models.

Artificially generated CSAM is on the rise, and it has the potential to jam up hotlines and deter resources from reporting agencies that work with law enforcement to find perpetrators and get it taken offline. The Internet Watch Foundation recently released a report saying that AI CSAM is “visually indistinguishable from real CSAM,” even to trained analysts. Earlier this month, a 404 Media investigation found people using popular image generation platform Civitai were creating what “could be considered child pornography.” And in May, the National Center for Missing and Exploited Children, a victim advocacy organization which runs a hotline for reporting CSAM, said it was preparing for a “flood” of artificially generated content.

Richardson told me that actual CSAM training models could mean more realistic abusive deepfakes of victims. “You could have an offender download Stable Diffusion, create a LoRA [Low-Rank Adaptation, a more narrowly-tuned deep learning model] for a specific victim, and start generating new imagery on this victim,” he said. Even if the victim’s abuse was long in the past and now they’re an adult, “now they're having new material created of them based on the existing CSAM that was out there,” he said. “So that's hugely problematic.”

“There's no reason that images of children being sexually abused should ever be in those datasets, both to be sure the models themselves don’t create undesirable results, but also for those victims to make sure their imagery is not continually and still being used for harmful purposes,” Sexton said.

a16z Funded AI Platform Generated Images That “Could Be Categorized as Child Pornography,” Leaked Documents Show

OctoML, the engine that powers a16z funded Civitai, thought the images could qualify as “child pornography,” but ultimately decided to keep working with the company anyway, internal Slack chats and other material shows.

EMANUEL MAIBERG

“Given what it's used to train, you can't argue that it's just like having a copy of the internet, so you're going to have some stuff on there that's bad or somehow illegal,” Thiel said. “You're operationalizing it by training the models on those things. And given that you have images that will repeat over and over in that dataset that makes the model more likely to not just represent the material, but you'd have the potential for resemblance to occur of actual people that were fed into the data set.”

bnew · Dec 20, 2023

WHO'S RESPONSIBLE?

Legally, there is no precedent yet for who’s responsible when a scraping tool gathers illegal imagery. As Vencu noted in his Discord message in 2021, LAION is disseminating links, not actual copies of images. “Since we are not distributing or deriving other images from originals, I do not think the image licensing apply,” he said in Discord to the question about whether illegal material was in the dataset.

Copyright infringement has been a major concern for artists and content creators whose imagery is being used to train AI models. In April, a German stock photographer asked LAION to exclude his photos from its datasets, and LAION responded by invoicing him for $979, claiming he filed an unjustified copyright claim. Earlier this year, a group of artists filed a class-action lawsuit against Stability AI, DeviantArt, and Midjourney for their use of image generator Stable Diffusion, which uses LAION’s datasets. And Getty Images recently sued Stability AI, claiming that the company copied more than 12 million images without permission.

“We have issues with those services, how they were built, what they were built upon, how they respect creator rights or not, and how they actually feed into deepfakes and other things like that,” Getty Images CEO Craig Peters told the Associated Press.

Spreading CSAM is a federal crime, and the US laws about it are extremely strict. It is of course illegal to possess or transmit files, but “undeveloped film, undeveloped videotape, and electronically stored data that can be converted into a visual image of child pornography” are also illegal under federal law. It’s not clear where URLs that link to child exploitation images would land under current laws, or at what point anyone using these datasets could potentially be in legal jeopardy.

Because anti CSAM laws are understandably so strict, researchers have had to figure out new ways of studying its spread without breaking the law themselves. Westlake told me he relies on outsourcing some research to colleagues in Canada, such as the C3P, to verify or clean data, where there are CSAM laws that carve out exceptions for research purposes. Stanford similarly sent its methodology to C3P for verification. The Internet Watch Foundation has a memorandum of understanding granted to them by the Crown Prosecution Service, the principal public criminal prosecuting agency in the UK, to download, view, and hold content for its duties, which enables it to proactively search for abusive content and report it to authorities. In the US, viewing, searching for, or possessing child exploitation material, even if accidentally, is a federal crime. [/SIZE]

“Places should no longer host those datasets for download."

Rogers and his colleague Kathryn Seigfried-Spellar at Purdue’s forensics department have a unique situation: They’re deputized, and have law enforcement status granted to them by local law enforcement to do their work. They have a physical space in a secure law enforcement facility, with surveillance cameras, key fobs, a secured network, and 12-factor identification where they must go if they want to do work like cleaning datasets or viewing CSAM for research or investigative purposes.

Even so, they’re incredibly careful about what they collect with scraping tools. Siegfried-Spellar told me she’s working on studying knuckles and hands because they often appear in abuse imagery and are as identifiable as faces, and could scrape images from NSFW Reddit forums where people post images of themselves masturbating, but she doesn’t because of the risk of catching underage imagery in the net.

“Even though you have to be over the age of 18 to use Reddit, I am never going to go and scrape that data and use it, or analyze it for my research, because I can't verify that somebody really is over the age of 18 that posted that,” she said. “There have been conversations about that as well: ‘there's pictures on the internet, why can’t I just scrape and use that for my algorithm training?’ But it's because I need to know the age of the sources.”

WHAT TO DO NOW

Because LAION-5B is open-source, lots of copies are floating around publicly, including on Hugging Face. Removing the dataset from Hugging Face, pulling CSAM links to abusive imagery out of the dataset, and then reuploading it, for example, would essentially create a roadmap for someone determined to view those files by comparing the differences between the two.

Thiel told me that he went into this study thinking the goal might be to get abusive material out of datasets, but now he believes it’s too late.

“Now I'm more of the opinion that [the LAION datasets] kind of just need to be scratched,” he said. “Places should no longer host those datasets for download. Maybe there's an argument for keeping copies of it for research capacity, and then you can go through and take some steps to clean it.”

There is a precedent for this, especially when it comes to children’s data. The Federal Trade Commission has a term for model deletion as damage control: algorithm disgorgement. As an enforcement strategy, the FTC has used algorithm disgorgement in five cases involving tech companions that built models on improperly-obtained data, including a settlement with Amazon in May over charges that Alexa voice recordings violated children’s privacy, and a settlement between the FTC and Department of Justice and a children’s weight loss app that allegedly failed to properly verify parental consent. Both cases invoked the Children’s Online Privacy Protection Act (COPPA).

Child safety and AI are quickly becoming the next major battleground of the internet. In April, Democratic senator dikk Durbin introduced the “STOP CSAM Act,” which would make it a crime for providers to “knowingly host or store” CSAM or “knowingly promote or facilitate” the sexual exploitation of children, create a new federal crime for online services that “knowingly promote or facilitate” child exploitation crimes, and amend Section 230—the law that shields platforms from liability for their users’ actions—to allow for civil lawsuits by victims of child exploitation crimes against online service providers. Privacy advocates including the Electronic Frontier Foundation and the Center for Democracy and Technology oppose the act, warning that it could undermine end-to-end encryption services. The inclusion of “apparent” CSAM widens the net too much, they say, and the terms “promote” and “facilitate” are overly broad. It could also have a chilling effect on free speech overall: “First Amendment-protected content involving sexuality, sexual orientation, or gender identity will likely be targets of frivolous takedown notices,” EFF attorneys and surveillance experts wrote in a blog post.

In September, attorneys general from 50 states called on federal lawmakers to study how AI-driven exploitation can endanger children. “We are engaged in a race against time to protect the children of our country from the dangers of AI,” the prosecutors wrote. “Indeed, the proverbial walls of the city have already been breached. Now is the time to act.”

Thiel said he hadn’t communicated with LAION before the study was released. “We're not intending this as some kind of gotcha for any of the parties involved. But obviously a lot of very important mistakes were made in various parts of this whole pipeline,” he said. “And it's really just not how model training in the future should work at all.”

All of this is a problem that’s not going away, even—or especially—if it’s ignored. “They all have massive problems associated with massive data theft, non consensual, intimate images, Child Sexual Abuse material, you name it, it's in there. I’m kind of perplexed at how it's gone on this long,” Richardson said. “It's not that the technology is necessarily bad... it's not that AI is bad. It's the fact that a bunch of things were blindly stolen, and now we're trying to put all these Band-aids to fix something that really never should have happened in the first place.”

Become a paid subscriber for unlimited, ad-free articles and access to bonus content. This site is funded by subscribers and you will be directly powering our journalism.

Update 12/20, 8:19 a.m. EST: This headline was edited to remove the word "suspected" because 1,008 entries were externally validated.

Update 12/20, 11:20 a.m. EST: This story was corrected to reflect Common Crawl's inability to crawl Twitter, Instagram and Facebook.

bnew · Dec 20, 2023

AI Robot Outmaneuvers Humans in Maze Run Breakthrough

Computers have famously beaten humans at poker, Go and chess. Now they can learn the physical skills to excel at basic games of dexterity.

www.bloomberg.com

AI Robot Outmaneuvers Humans in Maze Run Breakthrough

Robot learned in record time to guide a ball through a maze
The AI robot used two knobs to manipulate playing surface

The Labyrinth game.Source: ETH Zurich

By Saritha Rai

December 19, 2023 at 7:30 AM EST

Computers have famously beaten humans at poker, Go and chess. Now they can learn the physical skills to excel at basic games of dexterity.

Researchers at ETH Zurich have created an AI robot called CyberRunner they say surpassed humans at the popular game Labyrinth. It navigated a small metal ball through a maze by tilting its surface, avoiding holes across the board, mastering the toy in just six hours, they said.

CyberRunner marked one of the first instances in which an AI beat humans at direct physical applications, said Raffaello D’Andrea and Thomas Bi, researchers at the prominent European institution. In experiments, their robot used two knobs to manipulate the playing surface, requiring fine motor skills and spatial reasoning. The game itself required real-time strategic thinking, quick decisions and precise action.

The duo shared their work in an academic paper published on Tuesday. They built their model on recent advances in a field called model-based reinforcement learning, a type of machine learning where the AI learns how to behave in a dynamic environment by trial and error.

“We are putting our work on an open-source platform to show it’s possible, sharing the details of how it’s done, and making it inexpensive to continue the work,” said D’Andrea, who co-founded Kiva Systems before selling itto Amazon.com Inc. “There will be thousands of these AI systems soon doing collaborative experiments, communicating and sharing best practices.”

Raffaello D’AndreaSource: ETH Zurich

Industrial robots have performed repetitive, precise manufacturing tasks for decades, but adjustments on-the-fly such as the ones CyberRunner demonstrated are next-level, the researchers said. The system can think, learn and self-develop on physical tasks, previously thought achievable only through human intelligence.

CyberRunner learns through experience, through a camera looking down at the labyrinth. During the process, it discovered surprising ways to “cheat” by skipping parts of the maze. The researchers had to step in and explicitly instruct it not to take shortcuts.

The duo’s open-sourced project is now available on their website. For $200, it can help users coordinate large-scale experiments using the CyberRunner platform.

“This is not a bespoke platform that costs a lot of money,” D’Andrea said. “The exciting thing is that we are doing it on a platform that’s open to everyone, and costs almost nothing to further advance the work.”

bnew · Dec 20, 2023

Seeking a Big Edge in A.I., South Korean Firms Think Smaller

While they lag behind their U.S. counterparts, their focus on non-English languages could help loosen the American grip on artificial intelligence.

www.nytimes.com

The features of Exaone, LG’s generative A.I., were demonstrated at LG A.I. Research in Seoul in September.Credit...Tina Hsu for The New York Times

Seeking a Big Edge in A.I., South Korean Firms Think Smaller

While they lag behind their U.S. counterparts, their focus on non-English languages could help loosen the American grip on artificial intelligence.

The features of Exaone, LG’s generative A.I., were demonstrated at LG A.I. Research in Seoul in September.Credit...Tina Hsu for The New York Times

By John Yoon
Reporting from Seoul

Dec. 20, 2023Updated 3:32 a.m. ET

ChatGPT, Bard, Claude. The world’s most popular and successful chatbots are trained on data scraped from vast swaths of the internet, mirroring the cultural and linguistic dominance of the English language and Western perspectives. This has raised alarms about the lack of diversity in artificial intelligence. There is also the worry that the technology will remain the province of a handful of American companies.

In South Korea, a technological powerhouse, firms are taking advantage of the technology’s malleability to shape A.I. systems from the ground up to address local needs. Some have trained A.I. models with sets of data rich in Korean language and culture. South Korean companies say they’re building A.I. for Thai, Vietnamese and Malaysian audiences. Others are eyeing customers in Brazil, Saudi Arabia and the Philippines, and in industries like medicine and pharmacy.

This has fueled hopes that A.I. can become more diverse, work in more languages, be customized to more cultures and be developed by more countries.

“The more competition is out there, the more systems are going to be robust: socially acceptable, safer, more ethical,” said Byong-Tak Zhang, a computer science professor at Seoul National University.

While there are some prominent non-American A.I. companies, like France’s Mistral, the recent upheaval at OpenAI, the maker of ChatGPT, has highlighted how concentrated the industry remains.

The emerging A.I. landscape in South Korea is one of the most competitive and diverse in the world, said Yong Lim, a professor of law at Seoul National University who leads its AI Policy Initiative. The country’s export-driven economy has encouraged new ventures to seek ways to tailor A.I. systems to specific companies or countries.

South Korea is well positioned to build A.I. technology, developers say, given it has one of the world’s most wired populations to generate vast amounts of data to train A.I. systems. Its tech giants have the resources to invest heavily in research. The government has also been encouraging: It has provided companies with money and data that could be used to train large language models, the technology that powers A.I. chatbots.

Commuters entering the Gangnam metro underground train station in Seoul.Credit...Anthony Wallace/Agence France-Presse — Getty Images

CCTV cameras in the Gangnam district are equipped with artificial intelligence technology to monitor crowd density and detect early signs of crowd control disasters.Credit...Soo-Hyeon Kim/Reuters

Few other countries have the combination of capital and technology required to develop a large language model that can power a chatbot, experts say. They estimate that it costs $100 million to $200 million to build a foundational model, the technology that serves as the basis for A.I. chatbots.

South Korea is still months behind the United States in the A.I. race and may never fully catch up, as the leading chatbots continue to improve with more resources and data.

But South Korean companies believe they can compete. Instead of going after the global market like their American competitors, companies like Naver and LG have tried to target their A.I. models to specific industries, cultures or languages instead of pulling from the entire internet.

“The localized strategy is a reasonable strategy for them,” said Sukwoong Choi, a professor of information systems at the University at Albany. “U.S. firms are focused on general-purpose tools. South Korean A.I. firms can target a specific area.”

Outside the United States, A.I. prowess appears limited in reach. In China, Baidu’s answer to ChatGPT, called Ernie, and Huawei’s large language model have shown some success at home, but they are far from dominating the global market. Governments and companies in other nations like Canada, Britain, India and Israel have also said they are developing their own A.I. systems, though none has yet to release a system that can be used by the public.

About a year before ChatGPT was released, Naver, which operates South Korea’s most widely used search engine, announced that it had successfully created a large language model. But the chatbot based on that model, Clova X, was released only this September, nearly a year after ChatGPT’s debut.

Nako Sung, the executive who leads Naver’s generative A.I. project.Credit...Tina Hsu for The New York Times

Nako Sung, an executive at Naver who has led the company’s generative A.I. project, said the timing of ChatGPT’s release surprised him.

“Up until that point, we were taking a conservative approach to A.I. services and just cautiously exploring the possibilities,” Mr. Sung said. “Then we realized that the timeline had been accelerated a lot,” he added. “We decided we had to move immediately.”

Now, Naver runs an A.I. model built for Korean language speakers from the ground up using public data from the South Korean government and from its search engine, which has scraped the country’s internet since 1999.

Inside Naver’s headquarters in Seongnam, South Korea.Credit...Tina Hsu for The New York Times

Mr. Sung typing a prompt into Naver’s A.I. service.Credit...Tina Hsu for The New York Times

Naver released its A.I. chatbot, Clova X, nearly a year after ChatGPT’s debut.Credit...Tina Hsu for The New York Times

Clova X recognizes Korean idioms and the latest slang — language that American-made chatbots like Bard, ChatGPT and Claude often struggle to understand. Naver’s chatbot is also integrated into the search engine, letting people use the tool to shop and travel.

Outside its home market, the company is exploring business opportunities with the Saudi Arabian government. Japan could be another potential customer, experts said, since Line, a messaging service owned by Naver, is used widely there.

LG has also created its own generative A.I. model, the type of artificial intelligence capable of creating original content based on inputs, called Exaone. Since its creation in 2021, LG has worked with publishers, research centers, pharmaceutical firms and medical companies to tailor its system to their data sets and provide them access to its A.I. system.

The company is targeting businesses and researchers instead of the general user, said Kyunghoon Bae, the director of LG A.I. Research. Its subsidiaries have also begun using its own A.I. chatbots. One of the chatbots, built to analyze chemistry research and chemical equations, has been used by researchers building new materials for batteries, chemicals and medicine.

Honglak Lee, chief scientist of LG A.I. Research, and Kyunghoon Bae, who leads the branch, in their office in Seoul.Credit...Tina Hsu for The New York Times

Se-hui Han and Rodrigo Hormazabal, developers at LG A.I. Research, demonstrating LG’s Exaone Discovery in Seoul.Credit...Tina Hsu for The New York Times

“Rather than letting the best one or two A.I. systems dominate, it’s important to have an array of models specific to a domain, language or culture,” said Honglak Lee, the chief scientist of LG’s A.I. research arm.

Another South Korean behemoth, Samsung, last month announced Samsung Gauss, a generative A.I. model being used internally to compose emails, summarize documents and translate text. The company plans to integrate it into its mobile phones and smart home appliances.

Other major companies have also said they are developing their own large language models, making South Korea one of the few countries with so many companies building A.I. systems. KT, a South Korean telecommunications firm, has said it is working with a Thai counterpart, Jasmine Group, on a large language model specialized in the Thai language. Kakao, which makes an eponymous super app for chats, has said it is developing generative A.I. for Korean, English, Japanese, Vietnamese and Malaysian.

Still, the United States’ dominance in A.I. appears secure for now. It remains to be seen how closely countries can catch up.

“The market is convulsing; it’s very difficult to predict what’s going to happen,” said Mr. Lim, the A.I. policy expert. “It’s the Wild West, in a sense.”

Workers waiting for their company-provided transport after work at a Samsung Electronics campus south of Seoul. Samsung plans to use incorporate its chatbot into cellphones and smart home appliances.Credit...Tina Hsu for The New York Times

bnew · Dec 20, 2023

Prompt engineering techniques

Prompt engineering techniques with Azure OpenAI - Azure OpenAI Service

Learn about the options for how to use prompt engineering with GPT-3, GPT-35-Turbo, and GPT-4 models

learn.microsoft.com

https://archive.is/usqoM

bnew · Dec 20, 2023

https://archive.is/ERg2C

Large Language Models News & Discussions

Veteran

Veteran

Nvidia Staffers Warned CEO of Threat AI Would Pose to Minorities​

Veteran

Nvidia Ranks Close to the Bottom in Diverse Hiring​

Veteran

OpenAI Says Board Can Overrule CEO on Safety of New AI Releases​

Veteran

Veteran

Veteran

AI cannot be patent 'inventor', UK Supreme Court rules in landmark case​

'LEGITIMATE QUESTIONS'​

Veteran

Large AI Dataset Has Over 1,000 Child Abuse Images, Researchers Find​

Veteran

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material​

Veteran

Veteran

Veteran

AI Robot Outmaneuvers Humans in Maze Run Breakthrough​

Veteran

Veteran

Prompt engineering techniques​

Veteran

Nvidia Staffers Warned CEO of Threat AI Would Pose to Minorities

Nvidia Ranks Close to the Bottom in Diverse Hiring

OpenAI Says Board Can Overrule CEO on Safety of New AI Releases

AI cannot be patent 'inventor', UK Supreme Court rules in landmark case

'LEGITIMATE QUESTIONS'

Large AI Dataset Has Over 1,000 Child Abuse Images, Researchers Find

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material

AI Robot Outmaneuvers Humans in Maze Run Breakthrough

Prompt engineering techniques