bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

Google is embedding inaudible watermarks right into its AI generated music​

/

SynthID will be used to watermark audio from DeepMind’s Lyria model, so it’s possible to work out if Google’s AI tech has been used in the creation of a track.​

By Jon Porter, a reporter with five years of experience covering consumer tech releases, EU tech policy, online platforms, and mechanical keyboards.

Nov 16, 2023, 6:38 AM EST|7 Comments / 7 New


Audio created using Google DeepMind’s AI Lyria model, such as tracks made with YouTube’s new audio generation features, will be watermarked with SynthID to let people identify their AI-generated origins after the fact. In a blog post, DeepMind said the watermark shouldn’t be detectable by the human ear and “doesn’t compromise the listening experience,” and added that it should still be detectable even if an audio track is compressed, sped up or down, or has extra noise added.

Watermarking tools like SynthID are seen as an important safeguard against some of the harms of generative AI. President Joe Biden’s executive order on artificial intelligence, for example, calls for a new set of government-led standards for watermarking AI-generated content. It’s a promising area, but current technologies are far from a silver bullet to defend against fakes.

According to DeepMind, SynthID’s audio implementation works by “converting the audio wave into a two-dimensional visualization that shows how the spectrum of frequencies in a sound evolves over time.” It claims the approach is “unlike anything that exists today.”

The news that Google is embedding the watermarking feature into AI-generated audio comes just a few short months after the company released SynthID in beta for images created by Imagen on Google Cloud’s Vertex AI. The watermark is resistant to editing like cropping or resizing, although DeepMind cautioned that it’s not foolproof against “extreme image manipulations.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

A video that appears to show David Attenborough narrating a programmer's life shows Hollywood actors are right to be afraid of AI​

Kai Xiang Teo

Nov 16, 2023, 4:09 AM EST

Documentary filmmaker David Attenborough, and actress Salma Hayek on that AI episode of Black Mirror.

Documentary filmmaker David Attenborough, and actress Salma Hayek on that AI episode of "Black Mirror" — which seems like less and less of a fantasy now. Rob Pinney/Getty Images and Netflix
  • A programmer has created an AI version of David Attenborough to narrate his life.
  • It's eerily good at nailing not only Attenborough's voice, but also his narration style.
  • AI clones aren't just a fantasy from that "Black Mirror" episode with Salma Hayek — they're the real deal.

If you've ever wanted acclaimed broadcaster and documentary filmmaker Sir David Attenborough to narrate your life, you're not alone — and you don't have to keep merely wishing for it anymore. A programmer named Charlie Holtz has turned that wish into reality with AI.

In a demo video Holtz shared on X, the platform formerly known as Twitter, Attenborough's voice can be heard describing Holtz as if he were a character in a film.

"Here we have a remarkable specimen of Homo Sapiens, distinguished by his silver circular spectacles and a mane of tousled curly locks," says Attenborough. Of course, this isn't really Attenborough — it's AI-Attenborough.

David Attenborough is now narrating my life


Here's a GPT-4-vision + @elevenlabsio python script so you can star in your own Planet Earth: pic.twitter.com/desTwTM7RS
— Charlie Holtz (@charliebholtz) November 15, 2023




The narration appears to be unscripted, autonomous, and surprisingly realistic at capturing not only the documentarian's trademark voice, but also his distinctive style of speech.

"He's wearing what appears to be a blue fabric covering, which can only be assumed to be part of its mating display," AI-Attenborough adds.

This short demo shows that AI clones aren't just a fantasy from that "Black Mirror" episode with Salma Hayek and Annie Murphy — they're the real deal.

Holtz is a "hacker in residence" at Replicate, a machine-learning startup. He's been posting quirky experiments with AI on X — like one that uses AI to recommend how you should correct your posture.

This latest experiment, shared on Wednesday, has amassed over 1 million views. And it's made possible by combining OpenAI's GPT-4-vision — an AI model that can describe what it sees — and code from Elevens Lab, an AI voice startup.

I'm not sure what Attenborough thinks of his AI clone — he didn't respond to Business Insider's request for comment sent outside regular business hours — but reading the reaction from X users makes it clear why Hollywood actors are afraid of AI.

One X user wrote, "I'm going to get David Attenborough to narrate videos of my baby learning how to eat broccoli."

The Screen Actors Guild board approved a deal with studios to conclude the actors' strike, but AI remains an ongoing topic of concern in the industry.

On Saturday, Justine Bateman, the AI advisor to the union's negotiating committee, criticized the agreement for not doing enough to protect actors against the creation of their "digital doubles" and replacement by "synthetic performers."

"You will now compete with every actor, dead or alive, who has made their "digital double" available for rent," Bateman wrote on X.

It reads like a prescient warning, especially now that Holtz has made the code behind his AI available online for others to use as they please.

Holtz did not immediately respond to BI's request for comment, sent outside regular business hours.



David Attenborough narrates your life




 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

Data labeling companies are raising prices in the AI boom​


AI chatbots require a lot of high-quality data—and now it's costing more​

By

Michelle Cheng
PublishedNovember 9, 2023

An undated handout image from U.S. startup Replika shows a user interacting with a smartphone app to customize an avatar for a personal artificial intelligence chatbot, known as a Replika, in San Francisco, California, U.S.

Avatars are still clunky-looking.Photo: Luka, Inc./Handout via Reuters (Reuters)

Need another indicator that the generative artificial intelligence industry is real and making progress? Look at the booming business of data labeling and annotation, which is an essential step in training the models that power AI products ranging from what’s currently vogue in the industry—chatbots!— to ongoing projects such as self-driving vehicles and tools that diagnose diseases.

During the data labeling step, usually, a team of humans will identify data points, whether that’s the severity of the damage in 100,000 photos of different cars for an insurance company, or the sentiments of people who interact with support agents for a customer service company. Data annotation is a critical step in the training of large-language models (LLMs) like OpenAI’s GPT because it makes AI models more accurate.

Following OpenAI’s release of ChatGPT last November, data annotation companies have received so much demand that it is pushing some of them to raise prices.

Realeyes is a company based in London that uses computer vision to read and understand human behavior; that data is then used to improve advertising effectiveness or to minimize identity fraud. Since the company was collecting and labeling data for its own computer vision algorithms, the company decided two years ago to move into an analogous service of data labeling for other companies, said Mihkel Jäätma, the CEO of Realeyes, which works with over 200 companies across media, technology, and advertising.

The data labeling service began generating revenue last year, with the business getting “very big, very quickly,” he said. Jäätma estimates that 80% of the business comes from companies essentially looking to make avatars less cartoonish. “It’s really kind of exploded to be a very substantial part of our business only in the last two years and keeps going that way,” he said.

From the likes of big tech companies and well-funded AI startups, “[t]he investment that we see is that this is going to be overlaid with very human-like [features],” he said. In other words, the work now is to make these avatars—bots that exhibit personalities based on made-up characters or real people—understand users and talk in a more human way.

Since the launch of its data labeling service, Realeyes has raised prices at least twice. Jäätma said he has had to tell customers that if they weren’t willing to pay up, Realeyes would not complete the full request.


Making avatars more human-like​

Labeling audio and visual recording is complex. It’s not just data scrapped from the Internet. Human annotators work on assessing people’s emotions, for example—and as that work gets more nuanced, it means paying the annotators more. (Realeyes was reportedly hired by Meta to make the tech giant’s avatars, which rolled out its own AI avatars in September, more human.)

Meanwhile, Snorkel AI, a company specializing in data labeling, said that the number of inquiries it received in the past three months was more than five times the total number received in the entire previous year, with requests coming from early-stage startups building large-language models (LLMs), as well as government agencies and IT companies.

The Redwood City, California-based company has not raised prices, but it has rolled out additional service offerings around AI training since customers’ needs have diversified.


Data labeling is already a $2.2 billion industry​

The growth in data labeling shows that generative AI applications are making progress. “With ChatGPT and other developments, the applications of AI are not out of reach,” said Devang Sachdev, vice president of marketing at Snorkel AI. The surge in AI products comes as LLMs from the likes of Google and OpenAI have also become much more accessible.

The global data collection and labeling market hit $2.2 billion in 2022 and it is expected to grow nearly 30% from 2023 to 2030, according to market research firm Grand View Research.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846




Introducing Alfred-40B-1023:​

Pioneering the Future of Open-Source Language Model from LightOn
We are thrilled to unveil Alfred-40B-1023, the latest iteration of our celebrated open-source Language Model. Building on the solid foundation of its predecessor, Alfred-40B-1023 represents a significant leap forward, catering specifically to the needs of enterprises.

What's New in Alfred-40B-1023?

Alfred-40B-1023 boasts a range of enhancements and new features, including:
  • Reduced Hallucinations: One of the standout features of Alfred-40B-1023 is its refined ability to minimize hallucinations, ensuring more accurate and reliable outputs.
  • Enhanced Self-Awareness: In situations where the model lacks a definitive answer, Alfred-40B-1023 is now programmed to state, "I don't know", enhancing its transparency and trustworthiness.
  • Superior 'Chat with Docs' Capability: Alfred-40B-1023 is trained to perform 'Chat with Docs' tasks like no other, streamlining document interaction and information retrieval.
  • Expanded Context: With an increased context of 8K tokens, Alfred-40B-1023 can comprehend and generate longer and more intricate content, ensuring detailed and comprehensive responses.

Continuing the Legacy of Its Predecessor​

Much like its predecessor, which was recognized for its prompt engineering, no-code application development, and classic LLM tasks, Alfred-40B-1023 is poised to set new benchmarks. The Falcon model continues to be the backbone, with Alfred-40B-1023 refining its capabilities to serve as an even more effective Generative AI copilot.

More than Just a Model – Introducing Paradigm:​

Paradigm is not just a platform; it's a conviction. We firmly believe that the future of AI in enterprises and governments lies not just in models but in a robust platform equipped with tools to deploy this groundbreaking technology seamlessly.

Commitment to Open Source​

In line with LightOn's dedication to promoting progress in the field, Alfred-40B-1023 is offered as an open-source model. While we persistently enhance the model, the open-source version might differ from the one on the Paradigm platform, ensuring that Paradigm users always have access to the most advanced version.

Training and Accessibility​

Alfred-40B-1023 continues to benefit from the efficient and reliable infrastructure of Amazon SageMaker for its training. Soon, Alfred-40B-1023 will also be available on platforms like HuggingFace and AWS Jumpstart for Foundation Models, making its integration into diverse workflows even smoother.

Join Us in This Exciting Journey​

We believe in the collective strength of the Generative AI community. With the release of Alfred-40B-1023, we showcase our deep expertise in training personalized models tailored to our clients' needs.

We invite the global community – to join us. Dive into Alfred-40B-1023, contribute, and be a part of this transformative journey. We’re not just offering a model; we're sharing a vision of the future.
Be sure to catch our unveiling of Alfred-40B-1023 during the AI Pulse Keynote.

About LightOn:​

A torchbearer in the Generative AI domain, LightOn is redefining the boundaries of AI capabilities. With a blend of pioneering models and robust platforms like Paradigm, we're guiding enterprises and governments into the next AI frontier.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

AI robotics’ ‘GPT moment’ is near​

Peter Chen@peterxichen / 9:35 AM EST•November 10, 2023


robustai

Image Credits: Robust.ai

Peter ChenContributor

Peter Chen is CEO and co-founder of Covariant, the world's leading AI robotics company. Before founding Covariant, Peter was a research scientist at OpenAI and a researcher at the Berkeley Artificial Intelligence Research (BAIR) Lab, where he focused on reinforcement learning, meta-learning, and unsupervised learning.

It’s no secret that foundation models have transformed AI in the digital world. Large language models (LLMs) like ChatGPT, LLaMA, and Bard revolutionized AI for language. While OpenAI’s GPT models aren’t the only large language model available, they have achieved the most mainstream recognition for taking text and image inputs and delivering human-like responses — even with some tasks requiring complex problem-solving and advanced reasoning.

ChatGPT’s viral and widespread adoption has largely shaped how society understands this new moment for artificial intelligence.

The next advancement that will define AI for generations is robotics. Building AI-powered robots that can learn how to interact with the physical world will enhance all forms of repetitive work in sectors ranging from logistics, transportation, and manufacturing to retail, agriculture, and even healthcare. It will also unlock as many efficiencies in the physical world as we’ve seen in the digital world over the past few decades.

While there is a unique set of problems to solve within robotics compared to language, there are similarities across the core foundational concepts. And some of the brightest minds in AI have made significant progress in building the “GPT for robotics.”


What enables the success of GPT?​

To understand how to build the “GPT for robotics,” first look at the core pillars that have enabled the success of LLMs such as GPT.

Foundation model approach​

GPT is an AI model trained on a vast, diverse dataset. Engineers previously collected data and trained specific AI for a specific problem. Then they would need to collect new data to solve another. Another problem? New data yet again. Now, with a foundation model approach, the exact opposite is happening.

Instead of building niche AIs for every use case, one can be universally used. And that one very general model is more successful than every specialized model. The AI in a foundation model performs better on one specific task. It can leverage learnings from other tasks and generalize to new tasks better because it has learned additional skills from having to perform well across a diverse set of tasks.

Training on a large, proprietary, and high-quality dataset​

To have a generalized AI, you first need access to a vast amount of diverse data. OpenAI obtained the real-world data needed to train the GPT models reasonably efficiently. GPT has trained on data collected from the entire internet with a large and diverse dataset, including books, news articles, social media posts, code, and more.

Building AI-powered robots that can learn how to interact with the physical world will enhance all forms of repetitive work.

It’s not just the size of the dataset that matters; curating high-quality, high-value data also plays a huge role. The GPT models have achieved unprecedented performance because their high-quality datasets are informed predominantly by the tasks users care about and the most helpful answers.

Role of reinforcement learning (RL)​

OpenAI employs reinforcement learning from human feedback (RLHF) to align the model’s response with human preference (e.g., what’s considered beneficial to a user). There needs to be more than pure supervised learning (SL) because SL can only approach a problem with a clear pattern or set of examples. LLMs require the AI to achieve a goal without a unique, correct answer. Enter RLHF.

RLHF allows the algorithm to move toward a goal through trial and error while a human acknowledges correct answers (high reward) or rejects incorrect ones (low reward). The AI finds the reward function that best explains the human preference and then uses RL to learn how to get there. ChatGPT can deliver responses that mirror or exceed human-level capabilities by learning from human feedback.


The next frontier of foundation models is in robotics​

The same core technology that allows GPT to see, think, and even speak also enables machines to see, think, and act. Robots powered by a foundation model can understand their physical surroundings, make informed decisions, and adapt their actions to changing circumstances.

The “GPT for robotics” is being built the same way as GPT was — laying the groundwork for a revolution that will, yet again, redefine AI as we know it.

Foundation model approach​

By taking a foundation model approach, you can also build one AI that works across multiple tasks in the physical world. A few years ago, experts advised making a specialized AI for robots that pick and pack grocery items. And that’s different from a model that can sort various electrical parts, which is different from the model unloading pallets from a truck.

This paradigm shift to a foundation model enables the AI to better respond to edge-case scenarios that frequently exist in unstructured real-world environments and might otherwise stump models with narrower training. Building one generalized AI for all of these scenarios is more successful. It’s by training on everything that you get the human-level autonomy we’ve been missing from the previous generations of robots.

Training on a large, proprietary, and high-quality dataset​

Teaching a robot to learn what actions lead to success and what leads to failure is extremely difficult. It requires extensive high-quality data based on real-world physical interactions. Single lab settings or video examples are unreliable or robust enough sources (e.g., YouTube videos fail to translate the details of the physical interaction and academic datasets tend to be limited in scope).

Unlike AI for language or image processing, no preexisting dataset represents how robots should interact with the physical world. Thus, the large, high-quality dataset becomes a more complex challenge to solve in robotics, and deploying a fleet of robots in production is the only way to build a diverse dataset.

Role of reinforcement learning​

Similar to answering text questions with human-level capability, robotic control and manipulation require an agent to seek progress toward a goal that has no single, unique, correct answer (e.g., “What’s a successful way to pick up this red onion?”). Once again, more than pure supervised learning is required.

You need a robot running deep reinforcement learning (deep RL) to succeed in robotics. This autonomous, self-learning approach combines RL with deep neural networks to unlock higher levels of performance — the AI will automatically adapt its learning strategies and continue to fine-tune its skills as it experiences new scenarios.


Challenging, explosive growth is coming​

In the past few years, some of the world’s brightest AI and robotics experts laid the technical and commercial groundwork for a robotic foundation model revolution that will redefine the future of artificial intelligence.

While these AI models have been built similarly to GPT, achieving human-level autonomy in the physical world is a different scientific challenge for two reasons:

  1. Building an AI-based product that can serve a variety of real-world settings has a remarkable set of complex physical requirements. The AI must adapt to different hardware applications, as it’s doubtful that one hardware will work across various industries (logistics, transportation, manufacturing, retail, agriculture, healthcare, etc.) and activities within each sector.
  2. Warehouses and distribution centers are an ideal learning environment for AI models in the physical world. It’s common to have hundreds of thousands or even millions of different stock-keeping units (SKUs) flowing through any facility at any given moment — delivering the large, proprietary, and high-quality dataset needed to train the “GPT for robotics.”


AI robotics “GPT moment” is near​

The growth trajectory of robotic foundation models is accelerating at a very rapid pace. Robotic applications, particularly within tasks that require precise object manipulation, are already being applied in real-world production environments — and we’ll see an exponential number of commercially viable robotic applications deployed at scale in 2024.

Chen has published more than 30 academic papers that have appeared in the top global AI and machine learning journals.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846







1/n Breaking News! OpenAI has uncovered an emergent new cognitive capability, yet nobody is demanding answers! We are distracted by OpenAI governance politics and not the real issue!!!
Nov 19, 2023 · 11:31 AM UTC

Carlos E. Perez
@IntuitMachine
8h
8h
2/n What is the breakthrough? I suspect it has to do with Retrieval Augment Generation (RAG). RAG is an architecture that allows a LLM to use a search engine to augment its reasoning. The problem has always been that the embeddings used by the search engine may not be the beneficial ones that augment the LLM's reasoning.
Carlos E. Perez
@IntuitMachine
8h
8h
3/n I reported earlier that GPTs were using Qdrant as their vector engine. This is a lightweight and fast Rust-based implementation. On Dev Day when GPTs were first made available, I tested RAG out using the books that I wrote. They performed terribly, Unable to even conjure up a correct table of contents! nitter.rawbit.ninja/IntuitMachine/st…
Carlos E. Perez
@IntuitMachine
8h
8h
4/n This changed remarkably when OpenAI rolled out its new model on November 11th. It was left unexplained how their RAG was so much improved. It's accurate in its responses across the books I've written. I know what the responses should be for these large documents because I wrote these books myself. nitter.rawbit.ninja/sama/status/1723…
Carlos E. Perez
@IntuitMachine
8h
8h
5/n It's possible that what's available to the pubic isn't the latest and greatest. But this is an indicator that OpenAI has done considerable work on RAG architectures.
Carlos E. Perez
@IntuitMachine
8h
8h
6/n What kind of work? What is appears to be is that OpenAI's model is able to compute in-context the optimal embedding that will receive the best information that will augment its original request. It's like a system that knows the best queries to answer a question. But not just that, a system that knows the best kind of search engine response that leads to the best answer. So it's beyond question making!
Carlos E. Perez
@IntuitMachine
8h
8h
7/n This is actually a much bigger a deal because GPT can now retrieve information that is *not* in its knowledge on the fly! It implies a first step towards an LLM that is not unencumbered by its original training set! It's a first step in a self-authoring mind.

Carlos E. Perez
@IntuitMachine
8h
8h
8/n If you are still wondering if AGI is approaching, you are totally wrong! It's already here! nitter.rawbit.ninja/IntuitMachine/st…
Carlos E. Perez
@IntuitMachine
Oct 25
Oct 25
Confirmation that AGI is indeed here!

The classic argument made over 30 years ago by Fodor and Pylyshyn - that neural networks fundamentally lack the systematic compositional skills of humans due to their statistical nature - has cast a long shadow over neural network research. Their critique framed doubts about the viability of connectionist models in cognitive science. This new research finally puts those doubts to rest.

Through an innovative meta-learning approach called MLC, the authors demonstrate that a standard neural network model can exhibit impressive systematic abilities given the right kind of training regimen. MLC optimizes networks for compositional skills by generating a diverse curriculum of small but challenging compositional reasoning tasks. This training nurtures in the network a talent for rapid systematic generalization that closely matches human experimental data.

The model not only displays human-like skills of interpreting novel systematic combinations, but also captures subtle patterns of bias-driven errors that depart from purely algebraic reasoning. This showcases the advantages of neural networks in flexibly blending structure and statistics to model the nuances of human cognition.

Furthermore, this research provides a framework for reverse engineering and imparting other human cognitive abilities in neural networks. The training paradigm bridges neuroscience theories of inductive biases with advanced machine learning techniques. The approach could potentially elucidate the origins of compositional thought in childhood development.

By resolving this classic debate on the capabilities of neural networks, and elucidating connections between human and artificial intelligence, this research marks an important milestone. The results will open new frontiers at the intersection of cognitive science and machine learning. Both fields stand to benefit enormously from this integration.

In summary, by settling such a historically significant critique and enabling new cross-disciplinary discoveries, this paper makes an immensely valuable contribution with profound implications for our understanding of intelligence, natural and artificial. Its impact will be felt across these disciplines for years to come.

Carlos E. Perez
@IntuitMachine
8h
8h
9/n If you are saying there aren't definitions of AGI, then try the paper linked in this thread for size: nitter.rawbit.ninja/IntuitMachine/st…
Carlos E. Perez
@IntuitMachine
Nov 7
Nov 7
9 definitions of Artificial General Intelligence (AGI) and why they are flawed.

1. The Turing Test
- Flaw: Focuses on fooling humans rather than intelligence, easy to game by producing human-like text without intelligence.

2. Strong AI - Consciousness
- Limitation: No agreement on measuring machine consciousness. Focus on vague concepts rather than capabilities.

3. Human Brain Analogy
- Limitation: While loosely inspired by the brain, successful AI need not strictly mimic biology. Overly constrains mechanisms.

4. Human Cognitive Task Performance
- Limitation: What tasks? Which people? Lacks specificity and measurement.

5. Ability to Learn Tasks
- Strength: Identifies learning as important AGI ability.
- Limitation: Still lacks concrete measurement.

6. Economically Valuable Work
- Limitation: Misses non-economic values of intelligence like creativity. Requires deployment.

7. Flexible & General - Coffee Test
- Strength: Concrete example tasks.
- Limitation: Proposed tasks may not fully define AGI.

8. Artificial Capable Intelligence
- Strength: Emphasizes complex, multi-step real-world tasks.
- Limitation: Focuses narrowly on profitability.

9. LLMs as Generalists
- Limitation: Lacks performance criteria - generality alone insufficient.

Carlos E. Perez
@IntuitMachine
8h
8h
10/n AGI is already here and it's just incrementally getting more capable! It's now all about crafting good curriculums!
Carlos E. Perez
@IntuitMachine
3h
3h
11/n I'm not the only one declaring AGI is here. Peter Norvig who wrote the classic book on AI also says the same:
Carlos E. Perez
@IntuitMachine
Oct 20
Oct 20
AGI is Here

The threshold for artificial general intelligence has undeniably been crossed. Though janky and flawed, today's models exhibit versatile, human-like competence at information tasks when prompted in natural language. This long-sought achievement springs from big data and computing power, not rules or symbols. The future promises rapid improvements to these proto-AGIs.

Yet healthy skepticism remains. Better tests can reveal limitations, spurring progress. And necessary societal debates await. How will AGI's benefits be shared? Its risks mitigated? But denialism cannot posterpone pressing policy questions. By recognising models' unprecedented breadth, not just depth, we take the first step toward oversight: acknowledging the evident breakthrough, humanity's historic ingenuity and the challenges ahead.

noemamag.com/artificial-gene…
Carlos E. Perez
@IntuitMachine
27m
27m
12/n Join the community to keep track of the latest AGI developments. Alternatively, please send me a $2.99 tip by subscribing. nitter.rawbit.ninja/i/communities/17…
Carlos E. Perez
@IntuitMachine
10m
10m
13/n The research paper on Self-RAG gives a glimpse of how LLMs can be fine-tuned to know how to query for information that augments their response:
Carlos E. Perez
@IntuitMachine
Oct 22
Oct 22
Tired of AI outputs that are factually incorrect or lack attribution? SELF-RAG offers a solution - an AI system capable of reflective reasoning and self-critique.

SELF-RAG decides when to retrieve external information based on relevance. It provides fine-grained assessments on whether its responses are supported by evidence. This enables transparency and verifiability.

Key benefits:

- Significantly higher accuracy on question answering and reasoning tasks compared to state-of-the-art models

- Improved factuality metrics on long-form text generation

- 70% higher citation accuracy versus retrieval-augmented methods

- Allows customization between accuracy, fluency, and completeness

By emulating human-like retrieval, reasoning, and self-reflection, SELF-RAG represents an important advance towards more trustworthy and controllable language generation.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

KGAG: Knowledge Graph Augmented Generation in Language Models

TLDR;
KGAG will be tomorrow what RAG is for now in grounding LLM inferences. But there is more, training LLMs to getting well reasoned knowledge driven inferences will be made possible for #KGAG = Knowledge Graph Augmented Generation

As someone deeply invested in the evolution of language models, I'm excited to share my vision for the next significant leap in this technology: Knowledge Graph Augmented Generation (KGAG). This approach promises to transcend the current capabilities of language models, offering a more nuanced, reasoned, and semantically rich interaction. Unlike Retrieval Augmented Generation (RAG), which primarily rephrases existing information, KGAG aims to deeply understand the semantics of knowledge and utilize it for a more insightful generation of content.

The power of Knowledge Graphs in LLMs

At the heart of KGAG lies the concept of knowledge graphs. A knowledge graph is a structured representation of facts, where entities and their interrelations are mapped in a way that mirrors human understanding. By integrating knowledge graphs with language models, we can achieve a more accurate and context-aware generation of content.

How Knowledge Graphs Enhance Language Models:

Contextual Understanding: Knowledge graphs provide a contextual framework, allowing language models to understand the relationships between concepts, rather than treating information as isolated data points.

Semantic Richness: They infuse semantic depth into the language models, enabling them to grasp the meaning behind words and phrases, beyond just syntax.

Reasoned Responses: By understanding relationships and hierarchies, language models can generate responses that are not just factually accurate but logically sound.

Building Knowledge Graphs: A Step-by-Step Guide

To harness the potential of KGAG, one must first build a robust knowledge graph. Here’s a simplified action plan:

Define the Domain and Scope: Clearly identify the domain for which the knowledge graph is to be created, and determine its scope.

Data Collection and Preparation: Gather relevant data sources and prepare the data by cleaning and normalizing it.

Schema Design: Create a schema that accurately represents the entities and relationships within your domain.

Entity Recognition and Linking: Utilize tools and resources like @spacy_io, Stanford NER (@stanfordnlp) and @nltk_org for entity recognition and link these entities within the knowledge graph.

Graph Construction: Choose an appropriate graph database like @JanusGraph, @NebulaGraph, @Neo4j, @memgraphdb to handle your graph data.

Refinement and Enrichment: Continually update and enrich the graph with new data and quality control measures.

Integration with Language Models: This is where the magic happens. Integrate your knowledge graph with language models to enable KGAG. This integration requires custom development, where the language model queries the knowledge graph to enrich its responses.

Leveraging Current LLM Capabilities with Knowledge Graphs

Given the current capabilities of large language models (LLMs), there are practical ways to start leveraging knowledge graphs:

Augment Data Feeds: Use knowledge graphs to augment the data fed into LLMs during training, ensuring richer context and semantic understanding.

Post-Processing Responses: Utilize knowledge graphs as a post-processing step for LLM outputs. Run model responses through a filter that references the knowledge graph for accuracy and depth.

Hybrid Query Systems: Develop systems where LLMs and knowledge graphs work in tandem - the LLM generates content, and the knowledge graph provides contextual checks and balances.

Continuous Learning Loop: Establish a feedback loop where LLMs learn from the evolving knowledge graph, constantly improving their understanding and output.

The Path Forward: Realizing the Potential of KGAG

As we venture into the integration of knowledge graphs with language models, the focus should be on pragmatic and actionable steps towards realizing this technology's potential. The journey toward Knowledge Graph Augmented Generation (KGAG) isn't just about high-level concepts; it's about tangible improvements in how we interact with and benefit from AI in everyday applications.

A Grounded Approach to KGAG:

Start with Specific Domains: Begin by implementing KGAG in specific domains where the impact can be directly measured, such as healthcare, legal, or financial services. This targeted approach allows for more controlled development and clearer assessment of benefits.

Collaboration Between Experts: Involve domain experts in the development process. Their insights are crucial in ensuring that the knowledge graph accurately reflects the nuances of the domain.

Focus on Incremental Improvements: Look for opportunities where KGAG can make incremental but significant improvements in existing systems. For instance, enhancing customer service chatbots in banks or support systems in hospitals.

Measure Impact Rigorously: Implement metrics to evaluate the effectiveness of KGAG. This could be in terms of accuracy, response time, user satisfaction, or other relevant KPIs.

Encourage Community Involvement: Foster a community around KGAG, inviting contributions, feedback, and ideas. This could involve open-source projects, hackathons, or academic partnerships.

Prepare for Ethical and Practical Challenges: Be proactive in addressing potential ethical implications, data privacy concerns, and biases in AI models. Establish guidelines and best practices for responsible use of KGAG.

Educate and Train the Workforce: As KGAG evolves, it’s vital to educate and train professionals to work with this new technology. Workshops, courses, and certifications can play a significant role here.

Stay Adaptive and Open to Feedback: As KGAG systems are deployed, continually gather user feedback and adapt the system. The goal is to ensure that these systems remain relevant and effective in real-world scenarios.

In essence, the path forward for KGAG is about grounding lofty ideas in practical applications, focusing on domains where it can make a real difference, and taking a measured, collaborative approach to development and deployment. It’s about building a technology that’s not only advanced but also useful, ethical, and accessible. This grounded approach will enable us to harness the full potential of KGAG in a way that benefits us all.





[Submitted on 30 May 2023]

Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation​

Minki Kang, Jin Myung Kwak, Jinheon Baek, Sung Ju Hwang
Language models have achieved impressive performances on dialogue generation tasks. However, when generating responses for a conversation that requires factual knowledge, they are far from perfect, due to an absence of mechanisms to retrieve, encode, and reflect the knowledge in the generated responses. Some knowledge-grounded dialogue generation methods tackle this problem by leveraging facts from Knowledge Graphs (KGs); however, they do not guarantee that the model utilizes a relevant piece of knowledge from the KG. To overcome this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a framework for generating context-relevant and knowledge-grounded dialogues with the KG. Specifically, our SURGE framework first retrieves the relevant subgraph from the KG, and then enforces consistency across facts by perturbing their word embeddings conditioned by the retrieved subgraph. Then, we utilize contrastive learning to ensure that the generated texts have high similarity to the retrieved subgraphs. We validate our SURGE framework on OpendialKG and KOMODIS datasets, showing that it generates high-quality dialogues that faithfully reflect the knowledge from KG.
Comments:Preprint. Under review
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2305.18846 [cs.CL]
(or arXiv:2305.18846v1 [cs.CL] for this version)
[2305.18846] Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation
Focus to learn more

Submission history​

From: Minki Kang [view email]
[v1] Tue, 30 May 2023 08:36:45 UTC (6,015 KB)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846


from another tweet

The MLC process using examples:

Imagine a language model named Charles.

Let's say Charles is presented with some novel words and a few examples that demonstrate their meaning:

Study examples:
glerk → RED CIRCLE
blicket → BLUE CIRCLE

Now Charles is given a new word and asked to interpret it systematically based on the examples:

Query:
fep

Charles leverages its compositional skills nurtured by MLC to infer that fep likely maps to a new color, since the examples mapped individual words to colored circles. It responds:

GREEN CIRCLE

In another episode, Charles is given some examples showing words that combine other words:

Study examples:
glerk kiki blicket → BLUE CIRCLE RED CIRCLE
blicket kiki glerk → RED CIRCLE BLUE CIRCLE

Query:
fep kiki glip

Charles recognizes kiki is combining words. It systematically composes the likely meanings of fep and glip from the examples, responding:

PURPLE CIRCLE ORANGE CIRCLE

By training on many such episodes requiring rapid generalization, MLC enables Claude to learn how to learn - to systematically compose meanings from limited examples.

This illustrates how the curriculum of compositional reasoning tasks teaches the model to exhibit human-like systematicity in novel situations, as quantified by its strong performance matching people



snippet:

Human-like systematic generalization through a meta-learning neural network​

Nature volume 623, pages115–121 (2023)Cite this article

Abstract​

The power of human language and thought arises from systematic compositionality—the algebraic ability to understand and produce novel combinations from known components. Fodor and Pylyshyn1 famously argued that artificial neural networks lack this capacity and are therefore not viable models of the mind. Neural networks have advanced considerably in the years since, yet the systematicity challenge persists. Here we successfully address Fodor and Pylyshyn’s challenge by providing evidence that neural networks can achieve human-like systematicity when optimized for their compositional skills. To do so, we introduce the meta-learning for compositionality (MLC) approach for guiding training through a dynamic stream of compositional tasks. To compare humans and machines, we conducted human behavioural experiments using an instruction learning paradigm. After considering seven different models, we found that, in contrast to perfectly systematic but rigid probabilistic symbolic models, and perfectly flexible but unsystematic neural networks, only MLC achieves both the systematicity and flexibility needed for human-like generalization. MLC also advances the compositional skills of machine learning systems in several systematic generalization benchmarks. Our results show how a standard neural network architecture, optimized for its compositional skills, can mimic human systematic generalization in a head-to-head comparison.

Main​

People are adept at learning new concepts and systematically combining them with existing concepts. For example, once a child learns how to ‘skip’, they can understand how to ‘skip backwards’ or ‘skip around a cone twice’ due to their compositional skills. Fodor and Pylyshyn1 argued that neural networks lack this type of systematicity and are therefore not plausible cognitive models, leading to a vigorous debate that spans 35 years2,3,4,5. Counterarguments to Fodor and Pylyshyn1 have focused on two main points. The first is that human compositional skills, although important, may not be as systematic and rule-like as Fodor and Pylyshyn indicated3,6,7. The second is that neural networks, although limited in their most basic forms, can be more systematic when using sophisticated architectures8,9,10. In recent years, neural networks have advanced considerably and led to a number of breakthroughs, including in natural language processing. In light of these advances, we and other researchers have reformulated classic tests of systematicity and reevaluated Fodor and Pylyshyn’s arguments1. Notably, modern neural networks still struggle on tests of systematicity11,12,13,14,15,16,17,18—tests that even a minimally algebraic mind should pass2. As the technology marches on19,20, the systematicity debate continues.
In this Article, we provide evidence that neural networks can achieve human-like systematic generalization through MLC—an optimization procedure that we introduce for encouraging systematicity through a series of few-shot compositional tasks (Fig. 1). Our implementation of MLC uses only common neural networks without added symbolic machinery, and without hand-designed internal representations or inductive biases. Instead, MLC provides a means of specifying the desired behaviour through high-level guidance and/or direct human examples; a neural network is then asked to develop the right learning skills through meta-learning21.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

Computer Science > Machine Learning​

[Submitted on 3 Oct 2023]

Language Models Represent Space and Time​

Wes Gurnee, Max Tegmark
The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a coherent model of the data generating process -- a world model. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual ``space neurons'' and ``time neurons'' that reliably encode spatial and temporal coordinates. Our analysis demonstrates that modern LLMs acquire structured knowledge about fundamental dimensions such as space and time, supporting the view that they learn not merely superficial statistics, but literal world models.
Subjects:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:arXiv:2310.02207 [cs.LG]
(or arXiv:2310.02207v1 [cs.LG] for this version)
[2310.02207] Language Models Represent Space and Time
Focus to learn more

Submission history​

From: Wes Gurnee [view email]
[v1] Tue, 3 Oct 2023 17:06:52 UTC (6,602 KB)



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

Orca 2​

Orca 2 is a helpful assistant that is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning.

We open-source Orca 2 to encourage further research on the development, evaluation, and alignment of smaller LMs.

What is Orca 2’s intended use(s)?​

  • Orca 2 is built for research purposes only.
  • The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models.

How was Orca 2 evaluated?​

  • Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations.

Model Details​

Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper.

Please refer to LLaMA-2 technical report for details on the model architecture.

License​

Orca 2 is licensed under the Microsoft Research License.

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

Bias, Risks, and Limitations​

Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including:

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.

Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses.

Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information.

Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.

Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.

Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content.

Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning.

System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions.

Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models.

Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks.

This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application.


DEMO:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846


ML News of the day🚨

Is it hard to keep up with all the amazing things happening in the ML ecosystem? Here is a quick summary of yesterday's top 5 ML releases - from small models to video generation!🚀

1. Microsoft releases Orca 2
Orca 2 goal is to explore small LLM capabilities. Orca 2 is a Llama 2 fine-tune trained with high-quality synthetic data with different reasoning techniques.

Why is this interesting? Recent research trends have shown impressive small model capabilities many times comparable with models that are 5-10x larger.

Paper: arxiv.org/abs/2311.11045
Model: huggingface.co/microsoft/Orc…

2. SEINE - Video Diffusion Model
SEINE allows short-to-long video generation as well as image-to-video. Its focus is on high-quality long videos that keep consistency and have smooth transitions. For example, you can give an initial and a final image, and it will generate a smooth video out of it.

Paper: arxiv.org/abs/2310.20700
Repo: github.com/Vchitect/SEINE

3. System 2 Attention (S2A)
Soft attention can assign a probability to irrelevant parts of a context. S2A regenerates the context so irrelevant parts are removed. Using S2A contexts produces more factual and objective responses.

Paper: arxiv.org/abs/2311.11829

4. Nous-Yarn-Llama
This model extends Llama 2 70B by further training on long context data using the YaRN extension method, increasing Llama 2 context window from 4k tokens to 32k.

Model: huggingface.co/NousResearch/…

5. Video LlaVA
Video LLaVA is a robust large vision-language baseline model with a mixed dataset of images and videos. This model can answer questions about input videos, images, or both at the same time (e.g. does the flag in the image appear in the video?)

Paper: arxiv.org/abs/2311.10122
Demo: huggingface.co/spaces/Langua…

New resources

Understanding training loss patterns by @stasbekman nitter.unixfox.eu/StasBekman/statu…
Learn a lot about fine-tuning LLMs and LoRAs by @rasbt nitter.unixfox.eu/rasbt/status/172…
Does sketching work? Learn about this tool to reduce matrix dimensions. by @ethanepperly huggingface.co/blog/ethanepp…
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,201
Reputation
8,613
Daps
161,846

NOVEMBER 20, 2023

Editors' notes


Synthetic imagery sets new bar in AI training efficiency​

by Rachel Gordon, Massachusetts Institute of Technology

Synthetic imagery sets new bar in AI training efficiency

An MIT team studies the potential of learning visual representations using synthetic images generated by text-to-image models. They are the first to show that models trained solely with synthetic images outperform the counterparts trained with real images, in large-scale settings. Credit: Alex Shipps/MIT CSAIL via the Midjourney AI image generator

Data is the new soil, and in this fertile new ground, MIT researchers are planting more than just pixels. By using synthetic images to train machine learning models, a team of scientists recently surpassed results obtained from traditional "real-image" training methods.

At the core of the approach is a system called StableRep, which doesn't just use any synthetic images; it generates them through ultra-popular text-to-image models like Stable Diffusion. It's like creating worlds with words.

So what's in StableRep's secret sauce? A strategy called "multi-positive contrastive learning."

"We're teaching the model to learn more about high-level concepts through context and variance, not just feeding it data," says Lijie Fan, MIT Ph.D. student in electrical engineering, affiliate of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), lead researcher on the work currently posted to the arXiv preprint server.

"When multiple images, all generated from the same text, all treated as depictions of the same underlying thing, the model dives deeper into the concepts behind the images, say the object, not just their pixels."

This approach considers multiple images spawned from identical text prompts as positive pairs, providing additional information during training, not just adding more diversity but specifying to the vision system which images are alike and which are different. Remarkably, StableRep outshone the prowess of top-tier models trained on real images, such as SimCLR and CLIP, in extensive datasets.

"While StableRep helps mitigate the challenges of data acquisition in machine learning, it also ushers in a stride towards a new era of AI training techniques. The capacity to produce high-caliber, diverse synthetic images on command could help curtail cumbersome expenses and resources," says Fan.

The process of data collection has never been straightforward. In the 1990s, researchers had to manually capture photographs to assemble datasets for objects and faces. The 2000s saw individuals scouring the internet for data. However, this raw, uncurated data often contained discrepancies when compared to real-world scenarios and reflected societal biases, presenting a distorted view of reality.

The task of cleansing datasets through human intervention is not only expensive, but also exceedingly challenging. Imagine, though, if this arduous data collection could be distilled down to something as simple as issuing a command in natural language.

A pivotal aspect of StableRep's triumph is the adjustment of the "guidance scale" in the generative model, which ensures a delicate balance between the synthetic images' diversity and fidelity. When finely tuned, synthetic images used in training these self-supervised models were found to be as effective, if not more so, than real images.

Taking it a step forward, language supervision was added to the mix, creating an enhanced variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy but also displayed remarkable efficiency compared to CLIP models trained with a staggering 50 million real images.

Yet, the path ahead isn't without its potholes. The researchers candidly address several limitations, including the current slow pace of image generation, semantic mismatches between text prompts and the resultant images, potential amplification of biases, and complexities in image attribution, all of which are imperative to address for future advancements.

Another issue is that StableRep requires first training the generative model on large-scale real data. The team acknowledges that starting with real data remains a necessity; however, when you have a good generative model, you can repurpose it for new tasks, like training recognition models and visual representations.

The team notes that they haven't gotten around the need to start with real data; it's just that once you have a good generative model you can repurpose it for new tasks, like training recognition models and visual representations.

While StableRep offers a good solution by diminishing the dependency on vast real-image collections, it brings to the fore concerns regarding hidden biases within the uncurated data used for these text-to-image models. The choice of text prompts, integral to the image synthesis process, is not entirely free from bias, "indicating the essential role of meticulous text selection or possible human curation," says Fan.

"Using the latest text-to-image models, we've gained unprecedented control over image generation, allowing for a diverse range of visuals from a single text input. This surpasses real-world image collection in efficiency and versatility. It proves especially useful in specialized tasks, like balancing image variety in long-tail recognition, presenting a practical supplement to using real images for training," says Fan.

"Our work signifies a step forward in visual learning, towards the goal of offering cost-effective training alternatives while highlighting the need for ongoing improvements in data quality and synthesis."

"One dream of generative model learning has long been to be able to generate data useful for discriminative model training," says Google DeepMind researcher and University of Toronto professor of computer science David Fleet, who was not involved in the paper.

"While we have seen some signs of life, the dream has been elusive, especially on large-scale complex domains like high-resolution images. This paper provides compelling evidence, for the first time to my knowledge, that the dream is becoming a reality. They show that contrastive learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to improve myriad downstream vision tasks."

More information: Yonglong Tian et al, StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, arXiv (2023). DOI: 10.48550/arxiv.2306.00984

Journal information: arXiv

Provided by Massachusetts Institute of Technology



Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 1 Jun 2023 (v1), last revised 26 Oct 2023 (this version, v2)]

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners​

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
Comments:code is available at: this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2306.00984 [cs.CV]
(or arXiv:2306.00984v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2306.00984
Focus to learn more

Submission history​

From: Yonglong Tian [view email]
[v1] Thu, 1 Jun 2023 17:59:51 UTC (5,106 KB)
[v2] Thu, 26 Oct 2023 15:16:57 UTC (5,109 KB)
 
Top