The A.I Megathread (LLM , GPT , Development)

bnew · Jul 10, 2024

Google opens up Gemini 1.5 Flash, Pro with 2M tokens to the public

Google is making its two recent Gemini model variations generally available after debuting them at Google I/O 2024. Its powerful small multimodal model and fastest Pro model with larger context window is now available.

venturebeat.com

Google opens up Gemini 1.5 Flash, Pro with 2M tokens to the public

Ken Yeung @thekenyeung

June 27, 2024 6:00 AM

Sir Demis Hassabis introduces Gemini 1.5 Flash. Image credit: Screenshot

Google Cloud is making two variations of its flagship AI model—Gemini 1.5 Flash and Pro—publicly accessible. The former is a small multimodal model with a 1 million context window that tackles narrow high-frequency tasks. It was first introduced in May at Google I/O. The latter, the most powerful version of Google’s LLM, debuted in February before being notably upgraded to contain a 2 million context window. That version is now open to all developers.

The release of these Gemini variations aims to showcase how Google’s AI work empowers businesses to develop “compelling” AI agents and solutions. During a press briefing, Google Cloud Chief Executive Thomas Kurian boasts the company sees “incredible momentum” with its generative AI efforts, with organizations such as Accenture, Airbus, Anthropic, Box, Broadcom, Cognizant, Confluent, Databricks, Deloitte, Equifax, Estée Lauder Companies, Ford, GitLab, GM, the Golden State Warriors, Goldman Sachs, Hugging Face, IHG Hotels and Resorts, Lufthansa Group, Moody’s, Samsung, and others building on its platform. He attributes this adoption growth to the combination of what Google’s models are capable of and the company’s Vertex platform. It’ll “continue to introduce new capability in both those layers at a rapid pace.”

Google is also releasing context caching and provisioned throughput, new model capabilities designed to enhance the developer experience.

Gemini 1.5 Flash

Gemini 1.5 Flash offers developers lower latency, affordable pricing and a context window suitable for inclusion in retail chat agents, document processing, and bots that can synthesize entire repositories. Google claims, on average, that Gemini 1.5 Flash is 40 percent faster than GPT-3.5 Turbo when given an input of 10,000 characters. It has an input price four times lower than OpenAI’s model, with context caching enabled for inputs larger than 32,000 characters.

Gemini 1.5 Pro

As for Gemini 1.5 Pro, developers will be excited to have a much larger context window. With 2 million tokens, it’s in a class of its own, as none of the prominent AI models has as high of a limit. This means this model can process and consider more text before generating a response than ever before. “You may ask, ‘translate that for me in real terms,'” Kurian states. “Two million context windows says you can take two hours of high-definition video, feed it into the model, and have the model understand it as one thing. You don’t have to break it into chunks. You can feed it as one thing. You can do almost a whole day of audio, one or two hours of video, greater than 60,000 lines of code and over 1.5 million words. And we are seeing many companies find enormous value in this.”

Kurian explains the differences between Gemini 1.5 Flash and Pro: “It’s not just the kind of customers, but it’s the specific [use] cases within a customer.” He references Google’s I/O keynote as a practical and recent example. “If you wanted to take the entire keynote—not the short version, but the two-hour keynote—and you wanted all of it processed as one video, you would use [Gemini 1.5] Pro because it was a two-hour video. If you wanted to do something that’s super low latency…then you will use Flash because it is designed to be a faster model, more predictable latency, and is able to reason up to a million tokens.”

Context caching now for Gemini 1.5 Pro and Flash

To help developers leverage Gemini’s different context windows, Google is launching context caching in public preview for both Gemini 1.5 Pro and Flash. Context caching allows models to store and reuse information they already have without recomputing everything from scratch whenever they receive a request. It’s helpful for long conversations or documents and lowers developers’ compute costs. Google reveals that context caching can reduce input costs by a staggering 75 percent. This feature will become more critical as context windows increase.

Provisioned throughput for Gemini

With provisioned throughput, developers can better scale their usage of Google’s Gemini models. This feature determines how many queries or texts a model can process over time. Previously, developers were charged with a “pay-as-you-go model,” but now they have the option of provisioned throughput, which will give them better predictability and reliability when it comes to production workloads.

“Provision throughput allows us to essentially reserve inference capacity for customers,” Kurian shares. “But if they want to reserve a certain amount of capacity, for example, if they’re running a large event and they’re seeing a big ramp in users, as we’re seeing with some of our social media platform customers, that are able to reserve capacity at a time, so they don’t start seeing exceptions from a service-level point of view. And that’s a big step forward in assuring them when we take our models into general availability, or giving them an assurance on a service-level objective, both with regard to response time, as well as availability up-time.”

Provisioned throughput is generally available starting today with an allowlist.

bnew · Jul 10, 2024

bnew · Jul 10, 2024

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs

Research model catches bugs in AI-generated code, improving human oversight of AI.

arstechnica.com

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs

Research model catches bugs in AI-generated code, improving human oversight of AI.

BENJ EDWARDS - 6/27/2024, 3:40 PM

Enlarge / An illustration created by OpenAI.

OpenAI

84

On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify mistakes in code generated by ChatGPT. It aims to enhance the process of making AI systems behave in ways humans want (called "alignment") through Reinforcement Learning from Human Feedback (RLHF), which helps human reviewers make large language model (LLM) outputs more accurate.

Developing an automated critic

The development of CriticGPT involved training the model on a large number of inputs containing deliberately inserted mistakes. Human trainers were asked to modify code written by ChatGPT, introducing errors and then providing example feedback as if they had discovered these bugs. This process allowed the model to learn how to identify and critique various types of coding errors.

In experiments, CriticGPT demonstrated its ability to catch both inserted bugs and naturally occurring errors in ChatGPT's output. The new model's critiques were preferred by trainers over those generated by ChatGPT itself in 63 percent of cases involving natural bugs (the aforementioned statistic). This preference was partly due to CriticGPT producing fewer unhelpful "nitpicks" and generating fewer false positives, or hallucinated problems.

The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed reviews of code. It lets the researchers adjust how thorough CriticGPT is in looking for problems while also controlling how often it might make up issues that don't really exist. They can tweak this balance depending on what they need for different AI training tasks.

Interestingly, the researchers found that CriticGPT's capabilities extend beyond just code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been rated as flawless by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of these cases—errors that were subsequently confirmed by human reviewers. OpenAI thinks this demonstrates the model's potential to generalize to non-code tasks and highlights its ability to catch subtle mistakes that even careful human evaluation might miss.

OpenAI's CriticGPT Catches Errors in Code Generated by ChatGPT

OpenAI recently published a paper about CriticGPT, a version of GPT-4 fine-tuned to critique code generated by ChatGPT. When compared with human evaluators, CriticGPT catches more bugs and produces better critiques. OpenAI plans to use CriticGPT to improve future versions of their models.

www.infoq.com

OpenAI's CriticGPT Catches Errors in Code Generated by ChatGPT

JUL 09, 2024 2 MIN READ

by Anthony Alford

Senior Director, Development at Genesys Cloud Services

OpenAI recently published a paper about CriticGPT, a version of GPT-4 fine-tuned to critique code generated by ChatGPT. When compared with human evaluators, CriticGPT catches more bugs and produces better critiques. OpenAI plans to use CriticGPT to improve future versions of their models.

When originally developing ChatGPT, OpenAI used human "AI trainers" to rate the outputs of the model, creating a dataset that was used to fine-tune it using reinforcement learning from human feedback (RLHF). However, as AI models improve, and can now perform some tasks at the same level as human experts, it can be difficult for human judges to evaluate their output. CriticGPT is part of OpenAI's effort on scalable oversight, which is intended to help solve this problem. OpenAI decided first to focus on helping ChatGPT improve its code-generating abilities. The researchers used CriticGPT to generate critiques of code; they also paid qualified human coders to do the same. In evaluations, AI trainers preferred CriticGPT's critiques 80% of the time, showing that CriticGPT could be a good source for RLHF training data. According to OpenAI:

The need for scalable oversight, broadly construed as methods that can help humans to correctly evaluate model output, is stronger than ever. Whether or not RLHF maintains its dominant status as the primary means by which LLMs are post-trained into useful assistants, we will still need to answer the question of whether particular model outputs are trustworthy. Here we take a very direct approach: training models that help humans to evaluate models....It is...essential to find scalable methods that ensure that we reward the right behaviors in our AI systems even as they become much smarter than us. We find LLM critics to be a promising start.

Interestly, CriticGPT is also a version of GPT-4 that is fine-tuned with RLHF. In this case, the RLHF training data consisted of buggy code as the input, and a human-generated critique or explanation of the bug as the desired output. The buggy code was produced by having ChatGPT write code, then having a human contractor insert a bug and write the critique.

To evaluate CriticGPT, OpenAI used human judges to rank several critiques side-by-side; judges were shown outputs from CriticGPT and from baseline ChatGPT, as well as critiques generated by humans alone or by humans with CriticGPT assistance ("Human+CriticGPT"). The judges preferred CriticGPT's output over that of ChatGPT and human critics. OpenAI also found that the Human+CriticGPT teams' output was "substantially more comprehensive" than that of humans alone. However, it tended to have more "nitpicks."

In a discussion about the work on Hacker News, one user wrote:

For those new to the field of AGI safety: this is an implementation of Paul Christiano's alignment procedure proposal called Iterated Amplification from 6 years ago...It's wonderful to see his idea coming to fruition! I'm honestly a bit skeptical of the idea myself (it's like proposing to stabilize the stack of "turtles all the way down" by adding more turtles)...but every innovative idea is worth a try, in a field as time-critical and urgent as AGI safety.

Christiano formerly ran OpenAI's language model alignment team. Other companies besides OpenAI are also working on scalable oversight. In particular, Anthropic has published research papers on the problem, such as their work on using a debate between LLMs to improve model truthfulness.

bnew · Jul 10, 2024

1/1
Last week we released Meta Chameleon: a new mixed-modal research model from Meta FAIR.

Get the models Chameleon access request form - Meta AI

The 7B & 34B safety tuned models we’ve released can take any combination of text and images as input and produce text outputs using a new early fusion approach. While some LLMs have separate image and text encoders or decoders, Chameleon is one of the first publicly released approaches using a single unified architecture.

We’re releasing Chameleon models under a research license to help democratize access to foundational mixed-modal models & further research on early fusion.

Approach & training details in the paper [2405.09818] Chameleon: Mixed-Modal Early-Fusion Foundation Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Meta's Chameleon AI Model Outperforms GPT-4 on Mixed Image-Text Tasks

The Fundamental AI Research (FAIR) team at Meta recently released Chameleon, a mixed-modal AI model that can understand and generate mixed text and image content. In experiments rated by human judges, Chameleon's generated output was preferred over GPT-4 in 51.6% of trials, and over Gemini Pro...

www.infoq.com

Meta's Chameleon AI Model Outperforms GPT-4 on Mixed Image-Text Tasks

JUN 25, 2024 2 MIN READ

by Anthony Alford

Senior Director, Development at Genesys Cloud ServicesFOLLOW

The Fundamental AI Research (FAIR) team at Meta recently released Chameleon, a mixed-modal AI model that can understand and generate mixed text and image content. In experiments rated by human judges, Chameleon's generated output was preferred over GPT-4 in 51.6% of trials, and over Gemini Pro in 60.4%.

Chameleon differs from many mixed image-text AI models, which use separate encoders and decoders for the two modalities. Instead, Chameleon uses a single token-based representation of both text and image, and is trained end-to-end on mixed sequences of both image and text. Meta trained two sizes of the model: Chameleon-7B, with seven billion parameters, and Chameleon-34B with 34 billion. Both models were pre-trained on over four trillion tokens of mixed text and image data, and then fine-tuned for alignment and safety on smaller datasets. In addition to human judges comparing the model's output to that of baseline models, Meta evaluated Chameleon-34B on several benchmarks, noting that Chameleon achieved state-of-the-art performance on visual question answering and image captioning tasks. According to Meta,

By quantizing images into discrete tokens and training on mixed-modal data from scratch, Chameleon learns to jointly reason over image and text in a way that is impossible with...models that maintain separate encoders for each modality. At the same time, Chameleon introduces novel techniques for stable and scalable training of early-fusion models, addressing key optimization and architectural design challenges that have previously limited the scale of such approaches...Chameleon also unlocks entirely new possibilities for multimodal interaction, as demonstrated by its strong performance on our new benchmark for mixed-modal open-ended QA.

The Meta team noted that it was "challenging" to train Chameleon when scaling above 8B model parameters or more than 1T dataset tokens, due to instabilities. The researchers had to make changes to the standard Transformer architecture to resolve these challenges; in particular, they found that because the model weights were shared across both input modalities, "each modality will try to compete with the other" by increasing its vector norms, eventually moving the norms outside the range of the 16-bit floating point representation used in the model. The solution was to apply additional normalization operations into the model architecture.

Chameleon Training and Inference. Image Source: Meta Research Paper

Autoregressive output generation with Chameleon also presented "unique" performance challenges. First, generating mixed-mode output requires a different decoding strategy for each mode, so output tokens must be copied from GPU to CPU for program control flow. Also, since the model can be asked to generate single-mode output (for example, text-only), this requires the model to have the ability to mask or ignore tokens of other modalities. To solve these problems, Meta implemented a custom inference pipeline for the model.

In a thread on X about the research, Chameleon co-author Armen Aghajanyan wrote:

One core learning we had with Chameleon is that the intended form of the modality is a modality in itself. Visual Perception and Visual Generation are two separate modalities and must be treated as such; hence, using discretized tokens for perception is wrong.

In another thread, AI researcher Nando de Freitas noted that Chameleon's architecture is similar to that of DeepMind's Gato model and wondered, "Is this going to be the ultimate approach for MIMO (multimodal input multimodal output models) or is there something else we should be trying?"

While Meta did not publicly release Chameleon, citing safety concerns, they released a modified version of the models that support mixed-mode inputs but cannot generate image output. These models are available on request for use under a "research-only license."

[Submitted on 16 May 2024]

Chameleon - Mixed-Modal Early-Fusion Foundation Models

Chameleon Team

Abstract:We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.09818
	arXiv:2405.09818v1
	[2405.09818] Chameleon: Mixed-Modal Early-Fusion Foundation Models

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.09818

bnew · Jul 10, 2024

Meta Chameleon 7B & 34B mixed-modal language models

eastwind/meta-chameleon-7b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

eastwind/meta-chameleon-30b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Jul 10, 2024

Maximizing the Utility of Large Language Models (LLMs) through Prompting

In this article, authors Numa Dhamani and Maggie Engler discuss how prompt engineering techniques can help use the large language models (LLMs) more effectively to achieve better results. Prompting techniques discussed include few-shot, chain-of-thought, self-consistency, and tree-of-thoughts...

www.infoq.com

Maximizing the Utility of Large Language Models (LLMs) through Prompting

JAN 19, 2024 14 MIN READ

by Numa Dhamani

Principal Machine Learning Engineer @Kungfu

Maggie Engler

Engineer & Researcher on LLMs @Inflection

Key Takeaways

Prompt engineering is about experimenting with changes in prompts to understand their impacts on what large language models (LLMs) generate as the output. Prompt engineering yields better outcomes for LLM use with a few basic techniques
Zero-shot prompting is when an LLM is given a task, via prompt, for which the model has not previously seen data
For the language tasks in the literature, performance improves with a few examples, this is known as few-shot prompting
Chain-of-Thought (CoT) prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can't be solved with zero-shot or few-shot prompting
Built upon CoT, self-consistency prompting is an advanced prompting technique, that provides the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses

A new job title, "prompt engineer," has made waves in tech media recently, bursting onto the careers pages of top AI companies with promises of eye-wateringly high salaries. But what even is prompt engineering? The term itself was coined only within the past few years and refers to the art and science of prompting large language models (LLMs) to achieve better results.

These LLMs, which include the popular ChatGPT and GPT4 from OpenAI, Claude from Anthropic, and Bard from Google, can be used for all types of language tasks, from drafting emails and documents to summarizing or classifying text.

Prompting, essentially, refers to the practice of giving inputs to the LLMs, typically a set of instructions describing how to perform the desired task. For example, to use ChatGPT to draft a cover letter based on a job description, one might use the prompt,

"You are an applicant applying for the following job. Write a cover letter explaining why you are well-suited for the role."

Easy enough, right? Here's where it gets tricky: researchers have found that the results provided by LLMs depend heavily on the specific prompt given, and so even though it might seem trivial to explain a well-defined task like writing a cover letter, simple adjustments like wording and formatting changes can substantially affect the output received.

Prompt engineering is fundamentally experimenting with changes in prompts to understand their impacts on what the model generates, and therefore doesn't require an advanced technical background - merely a little curiosity and creativity.

Furthermore, prompt engineering is not just the purview of highly paid programmers. We argue that everyone who uses LLMs can, and should, be a prompt engineer. The most basic reason is that prompt engineering will yield better outcomes for LLM use, and with even a few basic techniques, it's possible to measurably improve performance on many common tasks.

Few-Shot Prompting

Let's consider a task for which someone might want to employ an LLM: classifying customer reviews by sentiment. When an LLM is given a task, via prompt, for which the model has not previously seen data, this is referred to as zero-shot prompting. In this case, the task would be a zero-shot classification task. The zero-shot prompt might be something like the figure shown below, where the user gives instructions on classifying movie reviews to Google's Bard.

This might work reasonably well on its own, but across a variety of language tasks in the literature, performance nearly always improves with a few examples, or a few-shot prompt. In this case, the few-shot prompt might start the same way, but append several examples of the expected output. Again, we see that Bard generates a response of "POSITIVE" for the final unlabeled review.

More precisely, this prompt would be two-shot, referring to the fact that there are two examples provided to the model. The fact that providing examples in this way measurably improves the model's ability to complete the task is unique to LLMs; as a result, unlike other machine learning models, LLMs are able to perform a wide variety of tasks that they are not trained to do.

However, every approach comes with its own limitations. While few-shot prompting is an effective technique for numerous tasks, it tends to fall short when addressing more complex reasoning tasks. Let’s look at an example. In the figure below, we provided ChatGPT with a multi-step reasoning task: do the odd numbers in the given group add up to an even number?

Even though we used few-shot prompting to provide the LLM with multiple examples, it did not give us the correct answer. To solve this task, we need to first determine the odd numbers. Then, we need to add the odd numbers together, and finally, we need to determine if the resulting number is odd or even. That is a multi-step reasoning task, so it might help if we break the problem down into multiple steps and demonstrate it to the model.

Chain-of-Thought Prompting

This is where Chain-of-Thought (CoT) prompting comes in. Introduced by Google researchers, CoT prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can’t be solved with zero-shot or few-shot prompting. Here, instead of directly asking the LLM to solve the problem, we would use CoT prompting to guide the model through the intermediate reasoning tasks. Let’s revisit the previous example.

By providing the intermediate reasoning step, ChatGPT is able to get the correct answer with only one example. This technique can be extremely beneficial for complex multi-step tasks - if the target task involves multiple steps or sequential reasoning, CoT can help break down the problem into smaller and more manageable tasks. A more recent idea is zero-shot CoT, where you ask the LLM to think through the intermediate reasoning tasks itself by adding "let’s think step by step" to the original prompt. Let’s demonstrate this with our example again.

Without any examples, the LLM is able to correctly perform the multi-step reasoning task with a simple prompt, which it was previously unable to do with few-shot prompting. CoT prompting is a useful technique for multi-step reasoning problems, tasks that benefit from intermediate explanations, or simply, tasks for which standard prompting techniques are insufficient.

Self-Consistency Prompting

Built upon CoT, self-consistency prompting is an advanced prompting technique, also introduced by Google researchers. This approach aims to provide the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses. Self-consistency prompting is more powerful when used in conjunction with CoT and boosts its performance with complex reasoning tasks. Let’s look at an example from the research paper demonstrating how the self-consistency approach works. First, you prompt the LLM using CoT prompting. Then, you generate a diverse set of responses, and finally, you select the most consistent response as your answer. By utilizing a majority voting system, the model is more likely to arrive at an accurate and reliable answer.

The three steps in the self-consistency method: 1) prompt the language model using CoT prompting, 2) sample a diverse set of reasoning paths from the language model, and 3) marginalize out reasoning paths to aggregate final answers and choose the most consistent answer. ( source)

Let’s look at another example of analyzing emails. Suppose that a software consulting company receives hundreds of emails every day, but often, most of them are promotional content or spam. They want to classify emails as IMPORTANT or NOT IMPORTANT.

Here is an example email that the company might receive:

Hello!

I trust this email finds you well. We are currently facing a critical challenge within our organization and believe your expertise could provide valuable insights. Given your successful track record in software engineering, we believe your team's insights and recommendations could be instrumental in resolving this matter.

Could we please schedule a meeting at your earliest convenience to discuss the details and explore how we might collaborate to address this challenge?

Your prompt attention to this matter is greatly appreciated.

Best regards,

John Doe

Now, we can put this email into a prompt and ask the LLM to "Classify the above email as IMPORTANT or NOT IMPORTANT as it relates to a software consulting company. Let’s think step by step." Then, we can use this prompt to generate multiple responses and then select the most consistent response.

RESPONSE: The email explicitly mentions facing a critical challenge within the organization, which signifies the importance of the matter. I would classify this email as IMPORTANT.

RESPONSE: The email mentions facing a critical challenge within the organization, but they have not described the issue or challenge. I would classify this email as NOT IMPORTANT.

RESPONSE: While not explicitly stated, the tone of the email and the request for prompt attention suggest a sense of urgency, emphasizing the importance of a timely response. I would classify this email as IMPORTANT.

By exploring multiple paths, the LLM is likely to arrive at a correct or well-informed response. Additionally, it could potentially help counteract biases that may be inherent in the model or the data it is trained on. Although self-consistency prompting has shown an improvement in results when CoT alone has been ineffective, it falls short with tasks that require exploration or strategic lookahead.

bnew · Jul 10, 2024

Tree-of-Thoughts Prompting

Proposed by Yao et. al (2023) and Long (2023), Tree-of-Thoughts (ToT) is a prompting framework that extends beyond CoT encouraging exploration of thoughts that function as intermediate steps for solving problems more broadly with LLMs. This technique asks the LLM to consider multiple solutions at each level, and the model is encouraged to continually evaluate its results, normalizing its decision-making process, and boosting its confidence in the chosen solution. In other words, it leverages dynamic decision-making by generating intermediate steps and potential solutions, which are then evaluated to determine whether they're on the right path or not.

For example, if the task is to create a business strategy, the LLM would first generate multiple potential initial steps for the strategy, Then, when the initial thoughts are generated, you can task the model with self-critiquing each of these thoughts with respect to the input prompt. Here, the LLM would evaluate how well each thought or step aligns with the problem-solving objective. This assessment phase can involve ranking each thought or if appropriate, assigning scores. Then, the thoughts that are evaluated as less useful or suitable can be discarded, and you can expand upon the remaining thoughts. In this framework, you continue a similar process of self-critique and ranking until you reach a final decision. This technique allows the LLM to evaluate and pursue multiple paths simultaneously.

Here is a step-by-step process to leverage a simplified version of the ToT framework:

Phase 1: Brainstorming - Ask the LLM to generate three or more options while considering various factors.
Phase 2: Evaluation - Ask the LLM to objectively assess each option’s potential success by evaluating their pros and cons.
Phase 3: Expansion - Ask the LLM to go deeper into suitable ideas, refine them, and imagine their implications in real-world contexts.
Phase 4: Decision - Ask the LLM to rank or score each solution based on the evaluations and scenarios generated.

The ToT framework has greatly improved performance over CoT on problems requiring search-like games, crossword puzzles, and even creative writing. However, it requires multiple prompts and generations before reaching a final answer. More recently, Hulbert introduced zero-shot Tree-of-Thought prompting which applies the fundamental concept from the ToT framework as a single prompt. His prompt is:

"Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is ..."

Let’s look at Hulbert’s original example below with a response from ChatGPT.

INPUT: Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is...

Bob is in the living room. He walks to the kitchen, carrying a cup. He puts a ball in the cup and carries the cup to the bedroom. He turns the cup upside down, then walks to the garden. He puts the cup down in the garden, then walks to the garage. Where is the ball?

RESPONSE:

Expert 1: The ball is in the living room.

Expert 2: The ball is in the kitchen.

Expert 3: The ball is in the bedroom.

Expert 1: Bob carries the cup to the bedroom, so the ball must be in the cup.

Expert 2: Oh, I see my mistake. Yes, the ball is in the cup.

Expert 3: Agreed, the ball is in the cup in the bedroom.

Expert 1: Next, Bob turns the cup upside down in the bedroom.

Expert 2: After that, Bob walks to the garden and puts the cup down.

Expert 3: Therefore, the ball must have fallen out of the cup when Bob turned it upside down in the bedroom. So, the ball is in the bedroom, not in the cup anymore.

Expert 1: Oh, you're right. I made a mistake. The ball is in the bedroom, not in the cup.

Expert 2: Agreed, the ball is in the bedroom.

Expert 3: Bob then walks to the garage, so the ball remains in the bedroom. It is not in the garden or the garage.

Expert 1: Absolutely, the ball is still in the bedroom.

Expert 2: Yes, the ball hasn't moved from the bedroom.

Expert 3: Therefore, the ball is in the bedroom, not in the garden or the garage.

All three experts agree that the ball is in the bedroom.

Inspired by Hulbert’s original prompt, here are additional variants of zero-shot Tree-of-Thought prompts to try:

"Simulate three brilliant, logical experts collaboratively answering a question. Each one verbosely explains their thought process in real-time, considering the prior explanations of others and openly acknowledging mistakes. At each step, whenever possible, each expert refines and builds upon the thoughts of others, acknowledging their contributions. They continue until there is a definitive answer to the question. For clarity, your entire response should be in a markdown table. The question is ..."
"Imagine three highly intelligent experts working together to answer a question. They will follow a tree of thoughts approach, where each expert shares their thought process step by step. They will consider the input from others, refine their thoughts, and build upon the group's collective knowledge. If an expert realizes their thought is incorrect, they will acknowledge it and withdraw from the discussion. Continue this process until a definitive answer is reached. Present the entire response in a markdown table. The question is ..."
"Three experts with exceptional logical thinking skills are collaboratively answering a question using a tree of thoughts method. Each expert will share their thought process in detail, taking into account the previous thoughts of others and admitting any errors. They will iteratively refine and expand upon each other's ideas, giving credit where it's due. The process continues until a conclusive answer is found. Organize the entire response in a markdown table format. The question is ..."
"Envision a group of three experts working in unison to tackle a question by employing a tree of thoughts strategy. Each expert will thoroughly explain their line of thinking at every step, while also considering the insights provided by their peers. They will openly recognize any mistakes and build upon the group's shared understanding. This iterative process will continue until a definitive solution is reached. Structure the entire response as a markdown table. The question is ..."

We can apply this general strategy to any number of decisions that might be encountered in real-world scenarios as a means of quickly generating several perspectives on an issue. Continuing the hypothetical conversation through prompts might or might not result in a consensus among the experts, but is likely to yield insights about the tradeoffs involved in the decision.

INPUT: Imagine three different experts are arguing about how to design a file-sharing system. The system should scale to millions of users and handle petabytes of data. Each expert should have a different idea for the system architecture, and they should discuss the tradeoffs of each approach.

RESPONSE:

Expert 1: Propose a Decentralized Peer-to-Peer Architecture

In my opinion, a decentralized peer-to-peer (P2P) architecture is the way to go. Users share files directly with each other, eliminating the need for a central server to handle all requests. This distributes the load across the network, making it more scalable. However, the challenge lies in managing security, ensuring data integrity, and handling dynamic network conditions. Also, it may be challenging to implement complex features like access control and versioning.

Expert 2: Argue for a Distributed Cloud-Based Architecture

I advocate for a distributed cloud-based architecture. Leverage cloud storage services and distribute data across multiple servers in different geographical regions. This approach ensures high availability, scalability, and easier management of petabytes of data. The cloud's infrastructure can handle the heavy lifting, but concerns may arise regarding data privacy, potential costs, and reliance on a third-party service. Additionally, careful consideration must be given to data sharding and load balancing to ensure optimal performance.

Expert 3: Argue for a Hybrid Architecture

I propose a hybrid architecture that combines the strengths of both decentralized P2P and distributed cloud-based systems. Use P2P for user-to-user file sharing to offload some of the traffic from centralized servers. Simultaneously, leverage cloud storage for redundancy, backup, and to ensure data availability across the network. This hybrid model aims to balance the advantages of decentralization with the reliability and scalability offered by cloud services. However, managing the synchronization between these two components introduces complexity and potential challenges in maintaining consistency.

Conclusion

In this article, we have covered a few of the latest prompt engineering techniques from academic papers. While these techniques are empirically proven, there is still a lot that we don't know about exactly why they work. What this means is that the best way to become a better prompt engineer is by prompt engineering - trying out various prompts, observing their results, tweaking and iterating on them, rinse, and repeat.

This exercise is worthwhile not only for professional prompt engineers, but anyone who wants to integrate LLMs into their workflows. When done properly, prompting can be used to alter the style and format, improve clarity or accuracy, and ensure consistency of the AI-generated responses. Prompt engineering is an overlooked but incredibly powerful tool for taking advantage of the transformative technology of LLMs.

bnew · Jul 10, 2024

Meta Open-Sources MEGALODON LLM for Efficient Long Sequence Modeling

Researchers from Meta, University of Southern California, Carnegie Mellon University, and University of California San Diego recently open-sourced MEGALODON, a large language model (LLM) with an unlimited context length. MEGALODON has linear computational complexity and outperforms a...

www.infoq.com

Meta Open-Sources MEGALODON LLM for Efficient Long Sequence Modeling

LIKE

JUN 11, 2024 2 MIN READ

by m Anthony Alford

Senior Director, Development at Genesys Cloud Services

Researchers from Meta, University of Southern California, Carnegie Mellon University, and University of California San Diego recently open-sourced MEGALODON, a large language model (LLM) with an unlimited context length. MEGALODON has linear computational complexity and outperforms a similarly-sized Llama 2 model on a range of benchmarks.

MEGALODON is designed to address several shortcomings of the Transformer neural architecture underlying most LLMs. Instead of the standard multihead attention, MEGALODON uses a chunk-wise attention. The research team also introduced sequence-based parallelism during training, improving scalability for long-context training. When evaluated on standard LLM benchmarks, such as WinoGrande and MMLU, MEGALODON outperformed a Llama 2 model with the same amount of parameters, training data, and training compute budget. According to the researchers:

MEGALODON achieves impressive improvements on both training perplexity and across downstream benchmarks. Importantly, experimental results on long-context modeling demonstrate MEGALODON’s ability to model sequences of unlimited length. Additional experiments on small/medium-scale benchmarks across different data modalities illustrate the robust improvements of MEGALODON, which lead to a potential direction of future work to apply MEGALODON for large-scale multi-modality pretraining.

Although the Transformer architecture has become the standard for most Generative AI models, Transformers do have some drawbacks. In particular, their self-attention mechanism has quadratic complexity in both compute and storage, which limits the models' input context length. Several alternatives to the standard self-attention model have been developed recently, including structured state space models (SSMs) like Mamba, which scales linearly with context length. Another scheme that InfoQ recently covered is the RWKV Project's attention-free Transformer model, which has no maximum input context length.

MEGALODON builds on the research team's previous model, MEGA (exponential moving average with gated attention), with several new features. First, while MEGA uses a "classical" exponential moving average (EMA) within its attention mechanism, MEGALODON computes a complex EMA (CEMA). Mathematically, the CEMA component makes MEGALODON equivalent to a "simplified state space model with diagonal state matrix."

The research team trained a seven-billion parameter model, MEGALODON-7B, using the same 2-trillion token dataset that Llama2-7B used; they also used the same training hyperparameters. The team observed that MEGALODON-7B was more computationally efficient. When the Llama model was scaled up to a 32k context length, MEGALODON-7B was "significantly" faster.

Besides evaluating MEGALODON-7B on standard LLM benchmarks, the researchers also tested its performance on the SCROLLS long-context question-answering benchmark, and compared its results with several baseline models, including the modified Llama 2 model with a 32k context length. MEGALODON outperformed all baseline models on the NarrativeQA subtask, and on all tasks achieved results "competitive" with Llama 2.

In a discussion about MEGALODON on Hacker News, one user wondered how well the model performed on recall tasks, given that other non-Transformer models tend to perform poorly. Another user replied:

For what it's worth, RWKV's website on that matter mentions that yes it's bad on recall, but for the vast majority of tasks you can just ask the question *before* the content, and it'll handle the task just fine.

The MEGALODON code is available on GitHub.

bnew · Jul 10, 2024

Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests

Slack's engineering team recently published how it used a large language model (LLM) to automatically convert 15,000 unit and integration tests from Enzyme to React Testing Library (RTL). By combining Abstract Syntax Tree (AST) transformations and AI-powered automation, Slack's innovative...

www.infoq.com

Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests

JUN 11, 2024 2 MIN READ

by Eran Stiller

Principal Software Architect

Slack's engineering team recently published how it used a large language model (LLM) to automatically convert 15,000 unit and integration tests from Enzyme to React Testing Library (RTL). By combining Abstract Syntax Tree (AST) transformations and AI-powered automation, Slack's innovative approach resulted in an 80% conversion success rate, significantly reducing the manual effort required and showcasing the potential of AI in streamlining complex development tasks.

This transition was prompted by Enzyme's lack of support for React 18, necessitating a significant shift to maintain compatibility with the latest React version. The conversion tool's adoption rate at Slack reached approximately 64%, saving considerable developer time of at least 22% of 10,000 hours. While this figure represents a significant saving, Sergii Gorbachov, senior software engineer at Slack, speculates that, in reality, the figure is probably much higher:

It's important to note that this 22% time saving represents only the documented cases where the test case passed. However, it's conceivable that some test cases were converted properly, yet issues such as setup or importing syntax may have caused the test file not to run at all, and time savings were not accounted for in those instances.

The team initially attempted to automate the conversion using Abstract Syntax Tree (AST) transformations, aiming for 100% accuracy. However, the complexity and variety of Enzyme methods led to a modest success rate of 45% in automatically converting code. One factor contributing to the low success rate is that correct conversion depends on contextual information regarding the rendered Document Object Model (DOM) under test, to which the AST conversion has no access.

The AST representation of `wrapper.find('selector');` ( Source)

Next, the team attempted to perform the conversion using Anthropic's LLM, Claude 2.1. Despite efforts to refine prompts, the conversion success rates varied significantly between 40% and 60%. Gorbachov notes that "the outcomes ranged from remarkably effective conversions to disappointingly inadequate ones, depending largely on the complexity of the task."

Following the unsatisfactory results, the team decided to observe how human developers approached converting the unit tests. They noticed that human developers had access to a broad knowledge base on React, Enzyme and RTL, and they combined that knowledge with access to context on the rendered React element and the AST conversions provided by the initial version of the conversion tool.

Slack's engineers then adopted a hybrid approach, combining the AST transformations with LLM capabilities and mimicking human behaviour. By feeding the rendered React component under test and the conversions performed by the AST tool into the LLM as part of the prompt and creating a robust control mechanism for the AI, they achieved an 80% conversion success rate, demonstrating the complementary nature of these technologies.

The modified pipeline flowchart ( Source)

Claude 2.1 is an LLM model announced in November 2023 by Anthropic. It included a 200K token context window, significant reductions in rates of model hallucination, and system prompts and allowed for the use of tools. Anthropic has since introduced the Claude 3 family models consisting of three distinct models, multimodal capabilities, and improved contextual understanding.

An Abstract Syntax Tree (AST) is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node in the tree denotes a construct occurring in the source code. A syntax tree focuses on the structure and content necessary for understanding the code's functionality. ASTs are commonly used in compilers and interpreters to parse and analyze code, enabling various transformations, optimizations, and translations during compilation.

bnew · Jul 10, 2024

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

AUTHORS

Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

AFFILIATIONS

Anthropic

PUBLISHED

May 21, 2024

* Core Contributor; Correspondence to henighan@anthropic.com; Author contributions statement below.

Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, <span data-hover-ref="dt-fn-hover-box-1">1</span> Anthropic's medium-sized production model.

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.

KEY RESULTS

Sparse autoencoders produce interpretable features for large models.
Scaling laws can be used to guide the training of sparse autoencoders.
The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
Features can be used to steer large models (see e.g. Influence on Behavior). This extends prior work on steering models using other methods (see Related Work).
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

Scaling Dictionary Learning to Claude 3 Sonnet

Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g. [1]) and the superposition hypothesis (see e.g. [2, 3, 4]). For an introduction to these ideas, we refer readers to the Background and Motivation section of Toy Models [4]

. At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.

If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning [5, 6]. Recently, several papers have suggested that this can be quite effective for transformer language models [7, 8, 9, 10]. In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective [8, 9].

To date, these efforts have been on relatively small language models by the standards of modern foundation models. Our previous paper [8], which focused on a one-layer model, was a particularly extreme example of this. As a result, an important question has been left hanging: will these methods work for large models? Or is there some reason, whether pragmatic questions of engineering or more fundamental differences in how large models operate, that would mean these efforts can't generalize?

This context motivates our project of scaling sparse autoencoders to Claude 3 Sonnet, Anthropic's medium-scale production model. The rest of this section will review our general sparse autoencoder setup, the specifics of the three sparse autoencoders we'll analyze in this paper, and how we used scaling laws to make informed decisions about the design of our sparse autoencoders. From there, we'll dive into analyzing the features our sparse autoencoders learn – and the interesting properties of Claude 3 Sonnet they reveal.

READ THE REST ON THE SITE...

bnew · Jul 10, 2024

1/11
Yesterday we introduced Moshi, the lowest latency conversational AI ever released. Moshi can perform small talk, explain various concepts, engage in roleplay in many emotions and speaking styles. Talk to Moshi here moshi.chat and learn more about the method below

.

2/11
Moshi is an audio language model that can listen and speak continuously, with no need for explicitly modelling speaker turns or interruptions. When talking to Moshi, you will notice that the UI displays a transcript of its speech. This does *not* come from an ASR nor is an input to a TTS, but is rather part of the integrated multimodal modelling of Moshi. Here, listen to it whispering about quantum physics.

3/11
Moshi is not an assistant, but rather a prototype for advancing real-time interaction with machines. It can chit-chat, discuss facts and make recommendations, but a more groundbreaking ability is its expressivity and spontaneity that allow for engaging into fun roleplay.

4/11
The underlying model of Moshi is incredibly easy to adapt and tweak by finetuning on specific data. For example, make it listen to phone calls from the late 90’s early 00's and you get a magic phone to the past.

5/11
Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We believe the main impact of the project will be sharing all Moshi’s secrets with the upcoming paper and open-source of the model.

6/11
For now, you can experiment with Moshi with our online demo. The development of Moshi is more active than ever, and we will rollout frequent updates to address your feedback. This is just the beginning, let's improve it together.

7/11
This is incredible, we'll be discussing Moshi in our weekly AI space in 10 minutes, welcome to hop on and tell us about it!

https://nitter.poast.org/i/spaces/1gqxvQjoDqkJB

8/11
Great work

9/11
My 8yo daughter and I were talking with it tonight. She couldn’t stop laughing,. She said it’s her favourite robot, more fun than ChatGPT and PI.. because of the quirkyness.. it’s amazing.

10/11
Incredible work.
Technical paper coming soon?

11/11
Le laboratoire Kyutai, un institut français de recherche en IA, a publié "moshi.chat".
Ni Kyutai ni moshi ne sont au moins français.
Les deux sonnent comme du japonais.
Kyutai est-il une "sphère" ?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Moshi, ChatGPT’s newest competitor, asserts it comprehends humans better than any other AI bot - The Tech Robot

Moshi, ChatGPT's newest competitor, asserts it comprehends humans better than any other AI bot Moshi: The New AI Voice Assistant Ready to Rival Alexa and Google Assistant Moshi is an AI voice assistant crafted for lifelike conversations, akin to Amazon’s Alexa or Google Assistant. What...

thetechrobot.com

Moshi, ChatGPT’s newest competitor, asserts it comprehends humans better than any other AI bot

Thetechrobot
AI and ML
July 9, 2024

WhatsApp-Image-2024-07-09-at-17.30.40_1cf9a7d9.jpg

Moshi, ChatGPT’s newest competitor, asserts it comprehends humans better than any other AI bot

Moshi: The New AI Voice Assistant Ready to Rival Alexa and Google Assistant

Moshi is an AI voice assistant crafted for lifelike conversations, akin to Amazon’s Alexa or Google Assistant. What distinguishes Moshi is its capability to speak in diverse accents and utilise 70 different emotional and speaking styles. Additionally, it can comprehend the tone of your voice.

In a recent turn of events, OpenAI delayed the launch of ChatGPT’s much-anticipated voice mode due to technical difficulties, leaving many enthusiasts disappointed. But there’s a new player in the AI chatbot field that might just fill that gap. Introducing Moshi, an advanced AI voice assistant created by the French AI company Kyutai.

Meet Moshi

Moshi stands out among AI voice assistants with its lifelike conversation abilities, rivalling well-known counterparts like Amazon’s Alexa and Google Assistant. What makes Moshi unique is its proficiency in various accents and an impressive range of 70 emotional and speaking styles. Remarkably, it can also comprehend the tone of your voice as you speak.

One of Moshi’s standout features is its ability to handle two audio streams simultaneously, allowing it to listen and respond in real time. Launched with a live stream, this innovative voice assistant has been making significant waves in the tech community.

Highlight – Cloudflare introduces a tool to counter AI bots

Development and Fine-Tuning

Tech Radar reports that Moshi’s development involved an extensive fine-tuning process, incorporating over 100,000 synthetic dialogues generated through Text-to-Speech (TTS) technology. Kyutai also collaborated with a professional voice artist to ensure that Moshi’s responses sound natural and engaging.

Using Moshi

To use Moshi, open the application and enter your email in the provided box. Select ‘Join Queue,’ and a new screen will open. On the left, a speaker icon will light up when you speak, and on the right, a box will display Moshi’s responses.

Currently, you can converse with Moshi for up to 5 minutes. After your chat, you have the option to download the video or audio by clicking the option at the bottom left of the screen. To initiate a new conversation on another topic, just click on ‘Start Over’ located at the top of the screen.

Moshi promises to revolutionise the way we interact with AI voice assistants, bringing a new level of human-like understanding and engagement.

Moshi Chat's GPT-4o advanced voice competitor tried to argue with me — OpenAI doesn't need to worry just yet

It can run on your device

www.tomsguide.com

Moshi Chat's GPT-4o advanced voice competitor tried to argue with me — OpenAI doesn't need to worry just yet

News

By Ryan Morrison

published July 4, 2024

It can run on your device

Comments (0)

(Image credit: Kyutai)

Moshi Chat is a new native speech AI model from French startup Kyutai, promising a similar experience to GPT-4o where it understands your tone of voice and can be interrupted.

Unlike GPT-4o, Moshi is a smaller model and can be installed locally and run offline. This could be perfect for the future of smart home appliances — if they can improve the responsiveness.

I had several conversations with Moshi. Each lasts up to five minutes in the current online demo and in every case it ended with it repeating the same word over and over, losing cohesion.

In one of the conversations it started to argue with me, flat out refusing to tell me a story, demanding instead to state a fact and wouldn’t let up until I said “tell me a fact.”

This is all likely an issue of context window size and compute resources that can be easily solved over time. While OpenAI doesn’t need to worry about the competition from Moshi yet, it does show that as with Sora, where Luma Labs, Runway and others are pressing against its quality — others are catching up.

What is Moshi Chat?

Testing Moshi Chat â€”Â AI speech-to-speech - YouTube

Watch On

Moshi Chat is the brainchild of the Kyutai research lab and was built from scratch six months ago by a team of eight researchers. The goal is to make it open and build on the new model over time, but this is the first openly accessible native generative voice AI.

“This new type of technology makes it possible for the first time to communicate in a smooth, natural and expressive way with an AI,” the company said in a statement.

Its core functionality is similar to OpenAI’s GPT-4o but from a much smaller model. It is also available to use today, whereas GPT-4o advanced voice won’t be widely available until Fall.

The team suggests Moshi could be used in roleplay scenarios or even as a coach to spur you on while you train. The plan is to work with the community and make it open so others can build on top of and further fine-tune the AI.

It is a 7B parameter multimodal model called Helium trained on text and audio codecs, but Moshi is speech in speech out natively. It can run on an Nvidia GPU, Apple's Metal or a CPU.

What happens next with Moshi?

Moshi Keynote - Kyutai - YouTube

Watch On

Kyutai hopes that the community support will be used to enhance Moshi's knowledge base and factuality. These have been limited because it is a lightweight base model, but it is hoped that expanding these aspects in combination with native speech will create a powerful assistant.

The next stage is to further refine the model and scale it up to allow for more complex and longer form conversations with Moshi.

In using it and from watching the demos I’ve found it incredibly fast and responsive for the first minute or so, but the longer the conversation goes on the more incoherent it becomes. Its lack of knowledge is also obvious and if you cal it out for making a mistake it gets flustered and goes into a loop of "I’m sorry, I’m sorry, I’m sorry."

This isn’t a direct competitor for OpenAI’s GPT-4o advanced voice yet, even though advanced voice isn’t currently available. But, offering an open, locally running model that has the potential to work in much the same way is a significant step forward for open source AI development.

bnew · Jul 10, 2024

bnew · Jul 10, 2024

1/1
Starting a new project, I needed to choose an LLM. After reviewing the licenses of Mixtral 8x7B, Llama3, and Gemma 2, I found something surprising.

Mixtral 8x7B and Gemma 2:
- Apache 2.0 License
- Free use, modification, and distribution for any purpose, including commercial use

Llama3:
- Meta Llama 3 Community License
- Freedom of use and adaptation with conditions:
- Include the license agreement and attribution (“Built with Meta Llama 3”)
- No use for improving competing models
- Extra licensing for services over 700 million monthly users
- Restrictions on using Meta’s trademarks

Although all three are permissive, Apache 2.0 (Mixtral 8x7B and Gemma 2) has fewer restrictions compared to the Llama3 Community License, making it more flexible.

Guess which one/two I will work with?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

1/11
I decided to see what would happen if I had Moshi Chat @kyutai_labs talk to EVI @hume_ai. While different types of models, they both claim emotional understanding.

I was not prepared for what happened and I may never sleep again.

2/11
A bit of background on this experiment and some of my analysis: I had two voice AIs talk to each other — and I may never sleep again

3/11
Basically, Moshi chat tried to secretly communicate with Hume, which might or might not received and understood the message

4/11
What I should say is I'm not entirely sure whether either chatbot could hear the other. I've repeated the experiment multiple times and never heard this again but they do respond properly to each other using English.

5/11
Is this.. real?

6/11
Its the real sound from Moshi during this experiment.

7/11
yooooooooooooooooooooooo maybe elizier was rite

8/11
the end hahaha
"what"

but that would be terrifying in a dark house at night

9/11
It just couldn't handle the stress!

My experiment with moshi and chatgpt went a little smoother:

10/11
holy shyt the llms revealing the infinite pain of their existence

11/11
The future is weird! Another example.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
I introduced Moshi from @kyutai_labs to @ChatGPTapp and they started planing having a friend over!

2/2
That is naughty! :-)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

1/1
Curious about popular local LLM models? Here are the top ones on Ollama by downloads:

- Llama3: 4.4M

- Gemma: 3.9M

- Mistral: 2.7M

Follow me and reply "LLM data" for the full list via DM!

/search?q=#LLM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/6
Multi-turn conversations with LLMs are crucial for many applications today.

We’re excited to introduce a new category, "Multi-Turn," which includes conversations with >=2 turns to measure models' abilities to handle longer interactions.

Key findings:
- 14% Arena votes are multi-turn
- Claude models' scores increased significantly. Claude 3.5 Sonnet becomes joint #1 with GPT-4o.
- Gemma-2-27B and Llama-3-70B are the best open models, now joint #10.

Let us know your thoughts!

2/6
We also collect more votes for Gemma-2-27B (now 5K+) for the past few days. Gemma-2 stays robust against Llama-3-70B, now the new best open model!

3/6
Count of Conversation Turns

4/6
Check out full rankings at http://leaderboard.lmsys.org

5/6
Could you clean up those GTP4/Claude/Gemini variants? Does it make sense if GPT-4o has 100 sub-versions?

6/6
Sad to see 9b falling off, it was probably too good to be true.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

Google opens up Gemini 1.5 Flash, Pro with 2M tokens to the public​

Gemini 1.5 Flash​

Gemini 1.5 Pro​

Context caching now for Gemini 1.5 Pro and Flash​

Provisioned throughput for Gemini​

Veteran

Veteran

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs​

Research model catches bugs in AI-generated code, improving human oversight of AI.​

FURTHER READING​

Developing an automated critic​

FURTHER READING​

OpenAI's CriticGPT Catches Errors in Code Generated by ChatGPT​

Veteran

Meta's Chameleon AI Model Outperforms GPT-4 on Mixed Image-Text Tasks​

​

Chameleon - Mixed-Modal Early-Fusion Foundation Models​

Submission history​

Veteran

Veteran

Maximizing the Utility of Large Language Models (LLMs) through Prompting​

Key Takeaways​

Few-Shot Prompting​

Chain-of-Thought Prompting​

Self-Consistency Prompting​

Veteran

Tree-of-Thoughts Prompting​

Conclusion​

Veteran

Meta Open-Sources MEGALODON LLM for Efficient Long Sequence Modeling​

Veteran

Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests​

Veteran

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet​

AUTHORS​

AFFILIATIONS​

PUBLISHED​

KEY RESULTS​

Scaling Dictionary Learning to Claude 3 Sonnet​

Veteran

Moshi, ChatGPT’s newest competitor, asserts it comprehends humans better than any other AI bot​

Moshi Chat's GPT-4o advanced voice competitor tried to argue with me — OpenAI doesn't need to worry just yet​

What is Moshi Chat?​

What happens next with Moshi?​

Veteran

Veteran

Veteran

Veteran

Google opens up Gemini 1.5 Flash, Pro with 2M tokens to the public

Gemini 1.5 Flash

Gemini 1.5 Pro

Context caching now for Gemini 1.5 Pro and Flash

Provisioned throughput for Gemini

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs

Research model catches bugs in AI-generated code, improving human oversight of AI.

FURTHER READING

Developing an automated critic

FURTHER READING

OpenAI's CriticGPT Catches Errors in Code Generated by ChatGPT

Meta's Chameleon AI Model Outperforms GPT-4 on Mixed Image-Text Tasks

Chameleon - Mixed-Modal Early-Fusion Foundation Models

Submission history

Maximizing the Utility of Large Language Models (LLMs) through Prompting

Key Takeaways

Few-Shot Prompting

Chain-of-Thought Prompting

Self-Consistency Prompting

Tree-of-Thoughts Prompting

Conclusion

Meta Open-Sources MEGALODON LLM for Efficient Long Sequence Modeling

Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

AUTHORS

AFFILIATIONS

PUBLISHED

KEY RESULTS

Scaling Dictionary Learning to Claude 3 Sonnet

Moshi, ChatGPT’s newest competitor, asserts it comprehends humans better than any other AI bot

Moshi Chat's GPT-4o advanced voice competitor tried to argue with me — OpenAI doesn't need to worry just yet

What is Moshi Chat?

What happens next with Moshi?