Large Language Models News & Discussions

bnew · Nov 3, 2023

Computer Science > Computation and Language

[Submitted on 28 May 2022 (v1), last revised 13 Jun 2022 (this version, v2)]

Teaching Models to Express Their Uncertainty in Words

Stephanie Lin, Jacob Hilton, Owain Evans

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. "90% confidence" or "high confidence"). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and is sensitive to uncertainty in its own answers, rather than imitating human examples. To our knowledge, this is the first time a model has been shown to express calibrated uncertainty about its own answers in natural language. For testing calibration, we introduce the CalibratedMath suite of tasks. We compare the calibration of uncertainty expressed in words ("verbalized probability") to uncertainty extracted from model logits. Both kinds of uncertainty are capable of generalizing calibration under distribution shift. We also provide evidence that GPT-3's ability to generalize calibration depends on pre-trained latent representations that correlate with epistemic uncertainty over its answers.

Comments:	CalibratedMath tasks and evaluation code are available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2205.14334 [cs.CL]
	(or arXiv:2205.14334v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.14334 Focus to learn more

Submission history

From: Stephanie Lin [view email]
[v1] Sat, 28 May 2022 05:02:31 UTC (2,691 KB)
[v2] Mon, 13 Jun 2022 05:04:53 UTC (2,691 KB)

https://arxiv.org/pdf/2205.14334.pdf

bnew · Nov 3, 2023

How can AI better understand humans? Simple: by asking us questions

This could save enterprise software developers a lot of time when booting up LLM-powered chatbots for customer or employee-facing apps

venturebeat.com

How can AI better understand humans? Simple: by asking us questions

Carl Franzen@carlfranzen

October 31, 2023 1:35 PM

A masculine presenting person in a gray suit and red necktie with square glasses and clean shaven face and short hair stands beside a boxy white and orange robot with blue visor. Both hold clipboards.

Credit: VentureBeat made with Midjourney

VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Anyone who has dealt in a customer-facing job — or even just worked with a team of more than a few individuals — knows that every person on Earth has their own unique, sometimes baffling, preferences.

Understanding the preferences of every individual is difficult even for us fellow humans. But what about for AI models, which have no direct human experience upon which to draw, let alone use as a frame-of-reference to apply to others when trying to understand what they want?

A team of researchers from leading institutions and the startup Anthropic, the company behind the large language model (LLM)/chatbot Claude 2, is working on this very problem and has come up with a seemingly obvious solution: get AI models to ask more questions of users to find out what they really want.

Entering a new world of AI understanding through GATE

Anthropic researcher Alex Tamkin, together with colleagues Belinda Z. Li and Jacob Andreas of the Massachusetts Institute of Technology’s (MIT’s) Computer Science and Artificial Intelligence Laboratory (CSAIL), along with Noah Goodman of Stanford, published a research paper earlier this month on their method, which they call “generative active task elicitation (GATE).”

EVENT

AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.

Learn More

Their goal? “Use [large language] models themselves to help convert human preferences into automated decision-making systems”

In other words: take an LLM’s existing capability to analyze and generate text and use it to ask written questions of the user on their first interaction with the LLM. The LLM will then read and incorporate the user’s answers into its generations going forward, live on the fly, and (this is important) infer from those answers — based on what other words and concepts they are related to in the LLM’s database — as to what the user is ultimately asking for.

As the researchers write: “The effectiveness of language models (LMs) for understanding and producing free-form text suggests that they may be capable of eliciting and understanding user preferences.”

The three GATES

The method can be applied in various ways, according to the researchers:

Generative active learning: The researchers describe this method as the LLM producing examples of the kind of responses it can deliver and asking how the user likes them. One example question they provide for an LLM to ask is: “Are you interested in the following article? The Art of Fusion Cuisine: Mixing Cultures and Flavors […] .” Based on what the user responds, the LLM will deliver more or less content along those lines.
Yes/no question generation: This method is as simple as it sounds (and gets). The LLM will ask binary yes or no questions such as: “Do you enjoy reading articles about health and wellness?” and then take into account the user’s answers when responding going forward, avoiding information that it associates with those questions that received a “no” answer.
Open-ended questions: Similar to the first method, but even broader. As the researchers write, the LLM will seek to obtain “the broadest and most abstract pieces of knowledge” from the user, including questions such as “What hobbies or activities do you enjoy in your free time […], and why do these hobbies or activities captivate you?”

Promising results

The researchers tried out the GATE method in three domains — content recommendation, moral reasoning and email validation.

By fine-tuning Anthropic rival’s GPT-4 from OpenAI and recruiting 388 paid participants at $12 per hour to answer questions from GPT-4 and grade its responses, the researchers discovered GATE often yields more accurate models than baselines while requiring comparable or less mental effort from users.

Specifically, they discovered that the GPT-4 fine-tuned with GATE did a better job at guessing each user’s individual preferences in its responses by about 0.05 points of significance when subjectively measured, which sounds like a small amount but is actually a lot when starting from zero, as the researchers’ scale does.

Fig. 3 chart from the paper “Eliciting Human Preferences With Language Models” published on arXiv.org dated Oct. 17, 2023.

Ultimately, the researchers state that they “presented initial evidence that LMs can successfully implement GATE to elicit human preferences (sometimes) more accurately and with less effort than supervised learning, active learning, or prompting-based approaches.”

This could save enterprise software developers a lot of time when booting up LLM-powered chatbots for customer or employee-facing applications. Instead of training them on a corpus of data and trying to use that to ascertain individual customer preferences, fine-tuning their preferred models to perform the Q/A dance specified above could make it easier for them to craft engaging, positive, and helpful experiences for their intended users.

So, if your favorite AI chatbot of choice begins asking you questions about your preferences in the near future, there’s a good chance it may be using the GATE method to try and give you better responses going forward.

bnew · Nov 3, 2023

LLMs have not learned our language — we’re trying to learn theirs

The limits of large language models (LLMs) have given rise to a trend of research in which we are learning the language of LLMs and discovering ways to better communicate with them.

venturebeat.com

LLMs have not learned our language — we’re trying to learn theirs

Ben dikkson@BenDee983

August 30, 2022 11:40 AM

Conversational AI Concept - Natural Language Processing - NLP - Computational Linguistics Concept

VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Large language models (LLMs) are currently a red-hot area of research in the artificial intelligence(AI) community. Scientific progress in LLMs in the past couple of years has been nothing short of impressive, and at the same time, there is growing interest and momentum to create platforms and products powered by LLMs.

However, in tandem with advances in the field, the shortcomings of large language models have also become evident. Many experts agree that no matter how large LLMs and their training datasets become, they will never be able to learn and understand our language as we do.

Interestingly, these limits have given rise to a trend of research focused on studying the knowledge and behavior of LLMs. In other words, we are learning the language of LLMs and discovering ways to better communicate with them.

What LLMs can’t learn

LLMs are neural networks that have been trained on hundreds of gigabytes of text gathered from the web. During training, the network is fed with text excerpts that have been partially masked. The neural network tries to guess the missing parts and compares its predictions with the actual text. By doing this repeatedly and gradually adjusting its parameters, the neural network creates a mathematical model of how words appear next to each other and in sequences.

EVENT

AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.

Learn More

After being trained, the LLM can receive a prompt and predict the words that come after it. The larger the neural network, the more learning capacity the LLM has. The larger the dataset (given that it contains well-curated and high-quality text), the greater chance that the model will be exposed to different word sequences and the more accurate it becomes in generating text.

However, human language is about much more than just text. In fact, language is a compressed way to transmit information from one brain to another. Our conversations often omit shared knowledge, such as visual and audible information, physical experience of the world, past conversations, our understanding of the behavior of people and objects, social constructs and norms, and much more.

As Yann LeCun, VP and chief AI scientist at Meta and award-winning deep learning pioneer, and Jacob Browning, a post-doctoral associate in the NYU Computer Science Department, wrote in a recent article, “A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.”

The two scientists note, however, that LLMs “will undoubtedly seem to approximate [human intelligence] if we stick to the surface. And, in many cases, the surface is enough.”

The key is to understand how close this approximation is to reality, and how to make sure LLMs are responding in the way we expect them to. Here are some directions of research that are shaping this corner of the widening LLM landscape.

Teaching LLMs to express uncertainty

In most cases, humans know the limits of their knowledge (even if they don’t directly admit it). They can express uncertainty and doubt and let their interlocutors know how confident they are in the knowledge they are passing. On the other hand, LLMs always have a ready answer for any prompt, even if their output doesn’t make sense. Neural networks usually provide numerical values that represent the probability that a certain prediction is correct. But for language models, these probability scores do not represent the LLM’s confidence in the reliability of its response to a prompt.

A recent paper by researchers at OpenAI and the University of Oxford shows how this shortcoming can be remedied by teaching LLMs “to express their uncertainty in words.”

They show that LLMs can be fine-tuned to express epistemic uncertainty using natural language, which they describe as “verbalized probability.” This is an important direction of development, especially in applications where users want to turn LLM output into actions.

The researchers suggest that expressing uncertainty can make language models honest. “If an honest model has a misinformed or malign internal state, then it could communicate this state to humans who can act accordingly,” they write.

Discovering emergent abilities of LLMs

Scale has been an important factor in the success of language models. As models become larger, not only does their performance improve on existing tasks, but they acquire the capacity to learn and perform new tasks.

In a new paper, researchers at Google, Stanford University, DeepMind, and the University of North Carolina at Chapel Hill have explored the “emergent abilities” of LLMs, which they define as abilities that “are not present in smaller models but are present in larger models.”

Emergence is characterized by the model manifesting random performance on a certain task until it reaches a certain scale threshold, after which its performance suddenly jumps and continues to improve as the model becomes larger.

The paper covers emergent abilities in several popular LLM families, including GPT-3, LaMDA, Gopher, and PaLM. The study of emergent abilities is important because it provides insights into the limits of language models at different scales. It can also help find ways to improve the capabilities of the smaller and less costly models.

Exploring the limits of LLMs in reasoning

Given the ability of LLMs to generate articles, write software code, and hold conversations about sentience and life, it is easy to think that they can reason and plan things like humans.

But a study by researchers at Arizona State University, Tempe, shows that LLMs do not acquire the knowledge and functions underlying tasks that require methodical thinking and planning, even when they perform well on benchmarks designed for logical, ethical and common-sense reasoning.

The study shows that what looks like planning and reasoning in LLMs is, in reality, pattern recognition abilities gained from continued exposure to the same sequence of events and decisions. This is akin to how humans acquire some skills (such as driving), where they first require careful thinking and coordination of actions and decisions but gradually become able to perform them without active thinking.

The researchers have established a new benchmark that tests reasoning abilities on tasks that stretch across long sequences and can’t be cheated through pattern-recognition tricks. The goal of the benchmark is to establish the current baseline and open new windows for developing planning and reasoning capabilities for current AI systems.

Guiding LLMs with better prompts

As the limits of LLMs become known, researchers find ways to either extend or circumvent them. In this regard, an interesting area of research is “prompt engineering,” a series of tricks that can improve the performance of language models on specific tasks. Prompt engineering guides LLMs by including solved examples or other cues in prompts.

One such technique is “chain-of-thought prompting” (CoT), which helps the model solve logical problems by providing a prompt that includes a solved example with intermediary reasoning steps. CoT prompting not only improves LLMs’ abilities to solve reasoning tasks, but it also gets them to output the steps they undergo to solve each problem. This helps researchers gain insights into LLMs’ reasoning process (or semblance of reasoning).

A more recent technique that builds on the success of CoT is “zero-shot chain-of-thought prompting,” which uses special trigger phrases such as “Let’s think step by step” to invoke reasoning in LLMs. The advantage of zero-shot CoT does not require the user to craft a special prompt for each task, and although it is simple, it still works well enough in many cases.

These and similar works of research show that we still have a lot to learn about LLMs, and there might be more to be discovered about the language models that have captured our fascination in the past few years.

bnew · Nov 3, 2023

Computer Science > Computation and Language

[Submitted on 29 May 2023 (v1), last revised 30 May 2023 (this version, v2)]

Do Large Language Models Know What They Don't Know?

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Xuanjing Huang

Large language models (LLMs) have a wealth of knowledge that allows them to excel in various Natural Language Processing (NLP) tasks. Current research focuses on enhancing their performance within their existing knowledge. Despite their vast knowledge, LLMs are still limited by the amount of information they can accommodate and comprehend. Therefore, the ability to understand their own limitations on the unknows, referred to as self-knowledge, is of paramount importance. This study aims to evaluate LLMs' self-knowledge by assessing their ability to identify unanswerable or unknowable questions. We introduce an automated methodology to detect uncertainty in the responses of these models, providing a novel measure of their self-knowledge. We further introduce a unique dataset, SelfAware, consisting of unanswerable questions from five diverse categories and their answerable counterparts. Our extensive analysis, involving 20 LLMs including GPT-3, InstructGPT, and LLaMA, discovering an intrinsic capacity for self-knowledge within these models. Moreover, we demonstrate that in-context learning and instruction tuning can further enhance this self-knowledge. Despite this promising insight, our findings also highlight a considerable gap between the capabilities of these models and human proficiency in recognizing the limits of their knowledge.

Comments:	10 pages, 9 figures, accepted by Findings of ACL2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.18153 [cs.CL]
	(or arXiv:2305.18153v2 [cs.CL] for this version)
	[2305.18153] Do Large Language Models Know What They Don't Know? Focus to learn more

Submission history

From: Zhangyue Yin [view email]
[v1] Mon, 29 May 2023 15:30:13 UTC (7,214 KB)
[v2] Tue, 30 May 2023 15:14:06 UTC (7,214 KB)

https://arxiv.org/pdf/2305.18153.pdf

bnew · Nov 3, 2023

https://venturebeat.com/ai/midjourneys-new-style-tuner-is-here-heres-how-to-use-it/

Midjourney’s new style tuner is here. Here’s how to use it.

Carl Franzen@carlfranzen

November 2, 2023 10:36 AM

A young masculine presenting person wearing sunglasses with neon orange lenses and a blue jean jacket over a white shirt types at a laptop in front of a colorful polygonal neon backdrop.

Credit: VentureBeat made with Midjourney

VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Midjourney is one of the most popular AI art and text-to-image generators, generating high-quality photorealistic and cinematic works from users’ prompts typed in plain English that have already wound up on TV and in cinemas (as well as on VentureBeat, where we use it along with other tools for article art).

Conceived by former Magic Leap programmer David Holz and launched in the summer of 2022, it has since attracted a community of more than 16 million users in its server on the separate messaging app Discord, and has been steadily updated by a small team of programmers with new features including panning, vary region and an anime-focused mobile app.

But its latest update launched on the evening of Nov. 1, 2023 — called the style tuner — is arguably the most important yet for enterprises, brands and creators looking to tell cohesive stories in the same style. That’s because Midjourney’s new style tuner allows users to generate their unique visual style and apply it to any and potentially all images generated in the application going forward.

Before style tuning, users had to repeat their text descriptions to generate consistent styles across multiple images — and even this was no guarantee, since Midjourney, like most AI art generators, is built to offer a functionally infinite variety of image styles and types.

Now instead, of relying on their language, users can select between a variety of styles and obtain a code to apply to all their works going forward, keeping them in the same aesthetic family. Midjourney users can also elect to copy and paste their code elsewhere to save it and reference it going forward, or even share it with other Midjourney users in their organization to allow them to generate images in that same style. This is huge for enterprises, brands, and anyone seeking to work on group creative projects in a unified style. Here’s how it works:

Where to find Midjourney’s style tuner

Going into the Midjourney Discord server, the user can simply type “/tune” followed by their prompt to begin the process of tuning their styles.

For example, let’s say I want to update the background imagery of my product or service website for the winter to include more snowy scenes and cozy spaces.

I can type in a single prompt idea I have — “a robot wears a cozy sweater and sits in front of a fireplace drinking hot chocolate out of a mug” — after the “/tune,” like this: “/tune a robot wears a cozy sweater and sits in front of a fireplace drinking hot chocolate out of a mug.”

Midjourney’s Discord bot responds with a large automatic message explaining the style-tuning process at a high level and asking if the user wants to continue. The process requires a paid Midjourney subscription plan (they start at $10 per month paid monthly or $96 per year up-front) and uses up some of the fast hours GPU credits that come with each plan (and vary depending on the plan tier level, with more expensive plans granting more fast hours GPU credits). These credits are used for generating images more rapidly than the “relaxed” mode.

Screenshot-2023-11-02-at-12.05.35-PM.png

Selecting style directions and mode and what they mean

This message includes two drop-down menus allowing the user to select different options: the number of “style directions” (16, 32, 64, or 128) and the “mode” (default or raw).

The “style directions” setting indicates how many different images Midjourney will generate from the user’s prompts, each one showing a distinctly different style. The user will then have the chance to choose their style from between these images, or combine the resulting images to create a new meta-style based on several of them.

Importantly, the different numbers of images produced by the different style direction options each cost a different amount of fast hours GPU credits. For instance, 16 style directions use up 0.15 fast hours of GPU credits, while 128 style directions use up 1.2 credits. So the user should think hard and discerningly about how many different styles they want to generate and whether they want to spend all those credits.

Meanwhile, the “mode” setting is binary, allowing the user to choose between default or raw, referencing how candid and grainy the photos will appear. Raw images are meant to look more like a film or DLSR camera and as such, may be more photorealistic, but also contain artifacts that the default, sanitized and smooth mode does not.

In our walkthrough for this article, VentureBeat selected 16 style directions and default mode. In our tests, and those reported by several users online, Midjourney was erroneously giving users one additional level up of style directions than they asked for — so in our case, we got 32 even though we asked for 16.

After selecting your mode and style directions, the Midjourney bot will ask you if you are sure you want to continue and show you again how many credits you’re using up, and if you press the green button, you can continue. The process can take up to 2 minutes.

Where to find the different styles to choose from

After Midjourney finishes processing your style tuner options, the bot should respond with a message saying “Style Tuner Ready! Your custom style tuner has finished generating. You can now view, share and generate styles here:” followed by a URL to the Midjourney Tuner website (the domain is tuner.midjourney.com).

The resulting URL should contain a random string of letters and numbers at the end. We’ve removed ours for security purposes in the screenshot below.

Screenshot-2023-11-02-at-12.22.10-PM-1.png

Clicking the URL takes the user out of the Discord app and onto the Midjourney website in your browser.

There, the user will see a customized yet default message from Midjourney showing the user’s prompt language and explaining how to finish the tuning process. Namely, Midjourney asks the user to select between two different options with labeled buttons: “Compare two styles at a time” or “Pick your favorite from a big grid.”

Screenshot-2023-11-02-at-12.26.26-PM.png

In the first instance, “compare two styles at a time” Midjourney displays the resulting grid of whatever number of images you selected previously in the style directions option in Discord in rows of two. In our case, that’s 16 rows. However, each row contains two 4×4 image grids, so 8 images per row.

Screenshot-2023-11-02-at-12.33.22-PM.png

The user can then choose one 4×4 grid from each row, of however many rows they would like, and Midjourney will make a style informed by the combination of those grids. You can tell which grid is selected by the white outline that appears around it.

So, if I chose the image on the right from the first row, and the image on the left from the bottom row, Midjourney would apply both of those image styles into a combined style and the user could apply that combined style to all images going forward. As Midjourney notes on the bottom of this selection page, selecting more choices from each row results in a more “nuanced and aligned” style while selecting only a few options will result in a “bold style.”

The second option, “Pick your favorite from a big grid,” lets the user choose just one image from the entire grid of all images generated from according to the number of style directions the user set previously. In our case for this article, that’s a total of 32 images arranged in an 8×4 grid. This option is more precise and less ambiguous than the “compare two styles” option, but also more limiting as a result.

Screenshot-2023-11-02-at-12.33.10-PM.png

In our case, for this article, we will select the “compare two styles at a time,” select 5 grids total and leave it to the algorithms to decide what the combined style looks like.

Screenshot-2023-11-02-at-12.38.14-PM.png

Applying your freshly tuned style going forward to new images and prompts

Whatever number of rows or images a user selects to base their style on, Midjourney will automatically apply that style and turn it into a shortcode of numerals and letters that the user can manually copy and paste for all prompts going forward. That shortcode appears in several places at the bottom of the user’s unique Style Tuner page, both in a section marked “Your code is:” followed by the code, and then also in a sample prompt based on the original the user provided at the very bottom in a persistent overlay chyron element.

Screenshot-2023-11-02-at-12.41.24-PM.png

The user can then either copy this code and save it somewhere, or copy their entire original prompt with the code added from the bottom chyron. You can also redo this whole style by pressing the small “refresh” icon at the bottom (circular arrows).

Then, the user will need to return to the Midjourney Discord server and paste the code in after their prompt as follows: “imagine/ a robot wears a cozy sweater and sits in front of a fireplace drinking hot chocolate out of a mug –style [INSERT STYLE CODE HERE]”

Here’s our resulting grid of 4×4 images using the original prompt and our freshly generated style:

Screenshot-2023-11-02-at-12.46.04-PM.png

We like the fourth one best, so we will select that one to upscale by clicking “U4” and voila, there is our resulting cozy robot drinking hot chocolate by the fireplace!

cfr0z3n_a_robot_wears_a_cozy_sweater_and_sits_in_front_of_a_fir_b736494f-41c5-4e9f-86d8-c4a6571107f0.png

Now let’s apply the same style to a new prompt by copying and pasting/manually adding the “–style” language to the end of our new prompt, like so: “a robot family opens presents –style [INSERT STYLE CODE HERE]” Here’s the result (after choosing one from our 4×4 grid):

cfr0z3n_a_robot_family_opens_presents_791d0e25-6080-4483-afec-f8dede339609.png

Not bad! Note this is after a few regenerations going back and forth. The style code also works alongside other parameters in your prompt, including aspect ratio/dimensions. Here’s a 16:9 version using the same prompt but written like so: “a robot family opens presents –ar 16:9 –style [INSERT STYLE CODE HERE]”

cfr0z3n_a_robot_family_opens_presents_ecfeb931-8fcf-4771-b943-bf098a353d05.png

Cute but a little wonky. We might suggest continuing to refine this one.

bnew · Nov 3, 2023

Microsoft unveils ‘LeMa’: A revolutionary AI learning method mirroring human problem solving

Microsoft's groundbreaking AI learning method, LeMa, trains machines to learn from their mistakes, enhancing problem-solving abilities and potentially revolutionizing sectors like healthcare, finance, and autonomous vehicles.

venturebeat.com

Microsoft unveils ‘LeMa’: A revolutionary AI learning method mirroring human problem solving

Michael Nuñez@MichaelFNunez

November 2, 2023 2:21 PM

Image of a Microsoft robot fixing another robot.

Credit: VentureBeat made with Midjourney

VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Researchers from Microsoft Research Asia, Peking University, and Xi’an Jiaotong University have developed a new technique to improve large language models’ (LLMs) ability to solve math problems by having them learn from their mistakes, akin to how humans learn.

The researchers have revealed a pioneering strategy, Learning from Mistakes (LeMa), which trains AI to correct its own mistakes, leading to enhanced reasoning abilities, according to a research paper published this week.

The researchers drew inspiration from human learning processes, where a student learns from their mistakes to improve future performance.
“Consider a human student who failed to solve a math problem, he will learn from what mistake he has made and how to correct it,” the authors explained. They then applied this concept to LLMs, using mistake-correction data pairs generated by GPT-4 to fine-tune them.

How LeMa works to enhance math reasoning

The researchers first had models like LLaMA-2 generate flawed reasoning paths for math word problems. GPT-4 then identified errors in the reasoning, explained them and provided corrected reasoning paths. The researchers used the corrected data to further train the original models.

The results of this new approach are significant. “Across five backbone LLMs and two mathematical reasoning tasks, LeMa consistently improves the performance compared with fine-tuning on CoT data alone,” the researchers explain.

LeMa yields impressive results on challenging datasets

What’s more, specialized LLMs like WizardMath and MetaMath also benefited from LeMa, achieving 85.4% pass@1 accuracy on GSM8K and 27.1% on MATH. These results surpass the state-of-the-art performance achieved by non-execution open-source models on these challenging tasks.

This breakthrough signifies more than just an enhancement in the reasoning capability of AI models. It also marks a significant step towards AI systems that can learn and improve from their mistakes, much like humans do.

Broad Implications and Future Directions

The team’s research, including their code, data, and models, is now publicly available on GitHub. This open-source approach encourages the broader AI community to continue this line of exploration, potentially leading to further advancements in machine learning.

The advent of LeMa represents a major milestone in AI, suggesting that machines’ learning (ML) processes can be made more akin to human learning. This development could revolutionize sectors heavily reliant on AI, such as healthcare, finance, and autonomous vehicles, where error correction and continuous learning are critical.

As the AI field continues to evolve rapidly, the integration of human-like learning processes, such as learning from mistakes, appears to be an essential factor in developing more efficient and effective AI systems.

This breakthrough in machine learning underscores the exciting potential that lies ahead in the realm of artificial intelligence. As machines become more adept at learning from their mistakes, we move closer to a future where AI can exceed human capabilities in complex problem-solving tasks.

bnew · Nov 3, 2023

[2310.20689] Learning From Mistakes Makes LLM Better Reasoner

Help | Advanced Search
All fields Title Author Abstract Comments Journal reference ACM classification MSC classification Report number arXiv identifier DOI ORCID arXiv author ID Help pages Full text
Search

Computer Science > Computation and Language

[Submitted on 31 Oct 2023]

Learning From Mistakes Makes LLM Better Reasoner

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, Weizhu Chen

Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems. To further improve this capability, this work proposes Learning from Mistakes (LeMa), akin to human learning processes. Consider a human student who failed to solve a math problem, he will learn from what mistake he has made and how to correct it. Mimicking this error-driven learning process, LeMa fine-tunes LLMs on mistake-correction data pairs generated by GPT-4. Specifically, we first collect inaccurate reasoning paths from various LLMs and then employ GPT-4 as a "corrector" to (1) identify the mistake step, (2) explain the reason for the mistake, and (3) correct the mistake and generate the final answer. Experimental results demonstrate the effectiveness of LeMa: across five backbone LLMs and two mathematical reasoning tasks, LeMa consistently improves the performance compared with fine-tuning on CoT data alone. Impressively, LeMa can also benefit specialized LLMs such as WizardMath and MetaMath, achieving 85.4% pass@1 accuracy on GSM8K and 27.1% on MATH. This surpasses the SOTA performance achieved by non-execution open-source models on these challenging tasks. Our code, data and models will be publicly available at this https URL.

Comments:	14 pages, 4 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.20689 [cs.CL]
	(or arXiv:2310.20689v1 [cs.CL] for this version)
	[2310.20689] Learning From Mistakes Makes LLM Better Reasoner Focus to learn more

Submission history

From: Shengnan An [view email]
[v1] Tue, 31 Oct 2023 17:52:22 UTC (594 KB)

https://arxiv.org/pdf/2310.20689.pdf

bnew · Nov 3, 2023

We still don't really understand what large language models are

The world has happily embraced large language models such as ChatGPT, but even researchers working in AI don't fully understand the systems they work on, finds Alex Wilkins

www.newscientist.com

We still don't really understand what large language models are

The world has happily embraced large language models such as ChatGPT, but even researchers working in AI don't fully understand the systems they work on, finds Alex Wilkins

By Alex Wilkins

4 October 2023

Portland, OR, USA - Jan 17, 2023: Webpages of ChatGPT, OpenAI's chatbot, and Google are seen on smartphones. A new wave of chatbots like ChatGPT use AI that can reinvent the traditional search engine.; Shutterstock ID 2250721589; purchase_order: -; job: -; client: -; other: -

Tada Images/Shutterstock

SILICON Valley’s feverish embrace of large language models (LLMs) shows no sign of letting up. Google is integrating its chatbot Bard into every one of its services, while OpenAI is imbuing its own offering, ChatGPT, with new senses, such as the ability to “see” and “speak”, envisaging a new kind of personal assistant. But deep mysteries remain about how these tools function: what is really going on behind their shiny interfaces, which tasks are they truly good at and how might they fail? Should we really be betting the house on technology with so many unknowns?

There are still large …
debates about what, exactly, these complex programs are doing. In February, sci-fi author Ted Chiang wrote a viral piece suggesting LLMs like ChatGPT could be compared to compression algorithms, which allow images or music to be squeezed into a JPEG or MP3 to save space. Except here, Chiang said, the LLMs were effectively compressing the entire internet, like a “blurry JPEG of the web”. The analogy received a mixed reception from researchers: some praised it for its insight, and others accused it of oversimplification.

It turns out there is a deep connection between LLMs and compression, as shown by a recent paper from a team at Google Deepmind, but you would have to be immersed in academia to know it. These tools, the researchers showed, do compression in the same way as JPEGs and MP3s, as Chiang suggested – they are shrinking the data into something more compact. But they also showed compression algorithms can work the other way, too, as LLMs, predicting the next word or number in a sequence. For instance, if you give the JPEG algorithm half of an image, it can predict what pixel would come next better than random noise.

This work was met with surprise even from AI researchers, for some because they hadn’t come across the idea, and for others because they thought it was so obvious. This may seem like an obscure academic warren that I have fallen down, but it highlights an important problem.

Many researchers working in AI don’t fully understand the systems they work on, for reasons of both fundamental mystery and for how relatively young the field is. If researchers at a top AI lab are still unearthing new insights, then should we be trusting these models with so much responsibility so quickly?

The nature of LLMs and how their actions are interpreted is only part of the mystery. While OpenAI will happily claim that GPT-4 “exhibits human-level performance on various professional and academic benchmarks”, it is still unclear exactly how the system performs with tasks it hasn’t seen before.

On their surface, as most AI scientists will tell you, LLMs are next-word prediction machines. By just trying to find the next most likely word in a sequence, they appear to display the power to reason like a human. But recent work from researchers at Princeton University suggests many cases of what appears to be reasoning are much less exciting and more like what these models were designed to do: next-word prediction.

For instance, when they asked GPT-4 to multiply a number by 1.8 and add 32, it got the answer right about half the time, but when those numbers are tweaked even slightly, it never gets the answer correct. That is because the first formula is the conversion of centigrade to Fahrenheit. GPT-4 can answer this correctly because it has seen that pattern many times, but when it comes to abstracting and applying this logic to similar problems that it has never seen, something even school kids are able to do, it fails.

For this reason, researchers warn that we should be cautious about using LLMs for problems they are unlikely to have seen before. But the millions of people that use tools like ChatGPT every day aren’t aware of this imbalance in its problem-solving abilities, and why should they be? There are no warnings about this on OpenAI’s website, which just states that “ChatGPT may produce inaccurate information about people, places, or facts”.

This also hints that OpenAI’s suggestion of “human-level performance” on benchmarks might be less impressive than it first seems. If these benchmarks are made mainly of high-probability events, then the LLMs’ general problem-solving abilities might be worse than they first appear. The Princeton authors suggest we might need to rethink how we assess LLMs and design tests that take into account how these models actually work.

Of course, these tools are still useful – many tedious tasks are high-probability, frequently occurring problems. But if we do integrate LLMs into every aspect of our lives, then it would serve us, and the tools’ creators, well to spend more time thinking about how they work and might fail.

bnew · Nov 3, 2023

bnew · Nov 3, 2023

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

Emotional intelligence significantly impacts our daily behaviors and interactions. Although Large Language Models (LLMs) are increasingly viewed as a stride toward artificial general intelligence, exhibiting impressive performance in numerous tasks, it is still uncertain if LLMs can genuinely...

arxiv.org

Computer Science > Computation and Language

[Submitted on 14 Jul 2023 (v1), last revised 20 Oct 2023 (this version, v5)]

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie

Emotional intelligence significantly impacts our daily behaviors and interactions. Although Large Language Models (LLMs) are increasingly viewed as a stride toward artificial general intelligence, exhibiting impressive performance in numerous tasks, it is still uncertain if LLMs can genuinely grasp psychological emotional stimuli. Understanding and responding to emotional cues gives humans a distinct advantage in problem-solving. In this paper, we take the first step towards exploring the ability of LLMs to understand emotional stimuli. To this end, we first conduct automatic experiments on 45 tasks using various LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. Our tasks span deterministic and generative applications that represent comprehensive evaluation scenarios. Our automatic experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts (which we call "EmotionPrompt" that combines the original prompt with emotional stimuli), e.g., 8.00% relative performance improvement in Instruction Induction and 115% in BIG-Bench. In addition to those deterministic tasks that can be automatically evaluated using existing metrics, we conducted a human study with 106 participants to assess the quality of generative tasks using both vanilla and emotional prompts. Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks (10.9% average improvement in terms of performance, truthfulness, and responsibility metrics). We provide an in-depth discussion regarding why EmotionPrompt works for LLMs and the factors that may influence its performance. We posit that EmotionPrompt heralds a novel avenue for exploring interdisciplinary knowledge for human-LLMs interaction.

Comments:	Technical report; short version (v1) was accepted by LLM@IJCAI'23; 32 pages; more work: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2307.11760 [cs.CL]
	(or arXiv:2307.11760v5 [cs.CL] for this version)
	[2307.11760] Large Language Models Understand and Can be Enhanced by Emotional Stimuli Focus to learn more

https://arxiv.org/pdf/2307.11760.pdf

[/r]

Telling GPT-4 you're scared or under pressure improves performance

Researchers show LLMs respond with improved performance when prompted with emotional context

aimodels.substack.com

Telling GPT-4 you're scared or under pressure improves performance

Researchers show LLMs respond with improved performance when prompted with emotional context

MIKE YOUNG

NOV 2, 2023
Share

Telling GPT-4 you're scared or under pressure improves performance

Adding emotional context makes LLMs like ChatGPT perform better.

In the grand narrative of artificial intelligence, the latest chapter might just be the most human yet.

A new paper indicates that AI models like GPT-4 can perform better when users express emotions such as urgency or stress. This discovery is particularly relevant for developers and entrepreneurs who utilize AI in their offerings, suggesting a new approach to prompt engineering that incorporates emotional context.

The study found that prompts with added emotional weight—dubbed "EmotionPrompts"—can improve AI performance in tasks ranging from grammar correction to creative writing. The implications are clear: incorporating emotional cues can lead to more effective and responsive AI applications.

For those embedding AI into products, these findings offer a tactical advantage. By applying this understanding of emotional triggers, AI can be fine-tuned to better meet user needs.

AIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why Emotion Matters in AI

The crux of the matter lies in the very nature of communication. When humans converse, they don't just exchange information; they share feelings, intentions, and urgency. It's a dance of context and subtext, often orchestrated by emotional cues. The question this study tackles is whether AI, devoid of emotion itself, can respond to the emotional weight we imbue in our words—and if so, does it alter its performance?

Communication transcends the exchange of information; it involves the interplay of emotions and intentions. In AI, understanding whether the emotional context of human interaction can enhance machine response is more than academic—it could redefine the effectiveness of AI in everyday applications. If AI can adjust its performance based on the emotional cues it detects, we're looking at a future where our interactions with machines could also be more intuitive and human-like, leading to better outcomes in customer service, education, and beyond.

Technical Insights: How LLMs Process Emotional Prompts

Before diving into the human-like responsiveness of AI, let's unpack the technical side. LLMs, such as GPT-4, are built on intricate neural networks that analyze vast amounts of text data. They identify patterns and relationships between words, sentences, and overall context to generate responses that are coherent and contextually appropriate.

The core innovation of the study lies in the introduction of "EmotionPrompt." This method involves integrating emotional significance into the prompts provided to LLMs. Unlike standard prompts, which are straightforward requests for information or action, EmotionPrompts carry an additional layer of emotional relevance—like stressing the importance of the task for one's career or implying urgency.

The Study's Findings in Detail

The integration of emotional cues into language models has introduced a fascinating dynamic: LLMs can produce superior outputs when the input prompts suggest an emotional significance. The recent study rigorously tested this phenomenon across a variety of models and tasks, offering a wealth of data that could reshape our understanding and utilization of AI.

A Closer Look at the Experiments

The researchers set out to evaluate the performance of LLMs when prompted with emotional cues—a technique they've termed "EmotionPrompt." To ensure the robustness of their findings, they designed 45 distinct tasks that covered a wide range of AI applications:

Deterministic Tasks: These are tasks with definitive right or wrong answers, such as grammar correction, fact-checking, or mathematical problem-solving. The models' performances on these tasks can be measured against clear benchmarks, providing objective data on their accuracy.
Generative Tasks: In contrast, generative tasks require the AI to produce content that may not have a single correct response. This includes creative writing, generating explanations, or providing advice. These tasks are particularly challenging for AI, as they must not only be correct but also coherent, relevant, and engaging.

Quantitative Improvements

In the deterministic tasks, researchers observed a notable increase in performance when using EmotionPrompts. For instance, when tasked with instruction induction—a process that tests the AI's ability to follow and generate instructions based on given input—the models showed an 8.00% improvement in their relative performance.

Even more striking was the performance leap in the BIG-Bench tasks, which serve as a broad benchmark for evaluating the abilities of language models. Here, the use of EmotionPrompts yielded an incredible 115% improvement over standard prompts. This suggests that the models were not only understanding the tasks better but also producing more accurate or appropriate responses when the stakes were presented as higher.

Human Evaluation

To complement the objective metrics, the study also incorporated a human element. A group of 106 participants evaluated the generative tasks' outputs, assessing the quality of AI-generated responses. This subjective analysis covered aspects such as performance, truthfulness, and responsibility—a reflection of the nuanced judgment humans bring to bear on AI outputs.

When assessing the quality of responses from both vanilla prompts and those enhanced with emotional cues, the participants noted an average improvement of 10.9%. This jump in performance highlights the potential for EmotionPrompts to not only elevate the factual accuracy of AI responses but also enhance their alignment with human expectations and values.

Implications of the Findings

The implications of these findings are manifold. On a technical level, they support the huge body of evidence that LLMs are sensitive to prompt engineering—a fact that can be harnessed to fine-tune AI outputs for specific needs. From a practical standpoint, the enhancements in performance with EmotionPrompts can lead to more effective AI applications in fields where accuracy and the perception of understanding are critical, such as in educational technology, customer service, and mental health support.

The improvements reported in the study are particularly significant as they point toward a new frontier in human-AI communication. By effectively simulating a heightened emotional context, we can guide AI to produce responses that are not only technically superior but also perceived as more thoughtful and attuned to human concerns. Basically, these findings suggest that telling your LLM that you're worried or under pressure to get a good generation makes them perform better, all else equal!

Caveats and Ethical Considerations

While the improvements are statistically significant, they do not imply that LLMs have emotional awareness. The increase in performance is a result of how these models have been engineered to process and prioritize information embedded in the prompts. Moreover, the study opens up a conversation about the ethical use of such techniques, as there's a fine line between enhancing AI performance and misleading users about the capabilities and sensitivities of AI systems.

In summary, the study's findings offer a compelling case for the strategic use of EmotionPrompts in improving LLM performance. The enhancements observed in both objective and subjective evaluations underscore the potential of integrating emotional nuances into AI interactions to produce more effective, responsive, and user-aligned outputs.

Significance in Plain English

When we tell AI that we're relying heavily on its answers, it "doubles down" to provide us with more precise, thoughtful, and thorough responses. The AI isn't actually feeling the pressure, but it seems to recognize these emotional signals and adjust its performance accordingly.

For those incorporating AI into their businesses or products, this isn't just an interesting tidbit; it's actionable intelligence. By understanding and utilizing emotional triggers effectively, AI can be made more responsive and useful.

Conclusion

In a nutshell, this research indicates that LLMs like GPT-4 respond with improved performance when prompted with emotional context, a finding that could be quite useful for developers and product managers. This isn't about AI understanding emotions but rather about how these models handle nuanced prompts. It's a significant insight for those looking to refine AI interactions, though it comes with ethical considerations regarding user expectations of AI emotional intelligence.

Emotionally aware AI doesn't just understand our words—it understands our urgency and acts accordingly. Pretty cool!

bnew · Nov 4, 2023

Elon Musk gives a glimpse at xAI's Grok chatbot

Elon Musk has shown off screenshots of xAI's Grok AI chatbot. It has the unique ability to grab the latest information from Twitter, something other bots don't do. It is also more humorous.

www.neowin.net

Elon Musk gives a glimpse at xAI's Grok chatbot

Paul Hill · Nov 4, 2023 04:28 EDT2

Yesterday, Neowin reported that xAI would be opening up its Grok generative AI chatbot to a limited audience, it’s still not clear today who this audience is but for the rest of us, Elon Musk has shown us some screenshots of what to expect. After an early beta, Grok will become available to all of the X Premium+ subscribers - that’s X’s most expensive paid tier.

Musk shared with us two screenshots of the new chatbot and a bit of information about it. First, it’s connected to the X platform which gives it access to real-time information giving it “a massive advantage over other models”, according to Musk. Second, this bot will be more lighthearted than other existing bots because it has some sarcasm and humor baked in, apparently this was a personal touch from the CEO.

The humor is actually an interesting aspect because when users give a dangerous query such as instructions for making illegal drugs, the bot will answer but with phony and sarcastic instructions before clarifying that it’s just kidding and wouldn’t encourage making drugs.

One of the big issues around AI at the moment is the seriousness everyone is taking it with. Some are saying it will be the end of jobs, others hate its artistic abilities and claim it’s not really art, and others complain that school kids shouldn’t be using it for homework.

The humorous nature of xAI’s Grok bot could help to make the bot feel more personal which could help with the overall view that these bots are helpful to people and not a detriment.

With regards to having access to the X platform, it will be really interesting to see how this turns out. Often, Twitter is much faster than legacy news outlets at reporting developments, however, this is rarely verified. It’ll be good to see if Grok can distinguish fact from fiction - perhaps community notes will be involved but Musk didn’t clarify.

Finally, if you're wondering about the name Grok, it's a term used to mean you understand something well. It's an appropriate name given how much these chatbots do know.

bnew · Nov 5, 2023

bnew · Nov 5, 2023

01-ai/Yi-34B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Introduction

The Yi series models are large language models trained from scratch by developers at 01.AI. The first public release contains two bilingual(English/Chinese) base models with the parameter sizes of 6B and 34B. Both of them are trained with 4K sequence length and can be extended to 32K during inference time.

News

2023/11/02: The base model of Yi-6B and Yi-34B.

Model Performance

Model	MMLU	CMMLU	C-Eval	GAOKAO	BBH	Common-sense Reasoning	Reading Comprehension	Math & Code
	5-shot	5-shot	5-shot	0-shot	3-shot@1	-	-	-
LLaMA2-34B	62.6	-	-	-	44.1	69.9	68.0	26.0
LLaMA2-70B	68.9	53.3	-	49.8	51.2	71.9	69.4	36.8
Baichuan2-13B	59.2	62.0	58.1	54.3	48.8	64.3	62.4	23.0
Qwen-14B	66.3	71.0	72.1	62.5	53.4	73.3	72.5	39.8
Skywork-13B	62.1	61.8	60.6	68.1	41.7	72.4	61.4	24.9
InternLM-20B	62.1	59.0	58.8	45.5	52.5	78.3	-	30.4
Aquila-34B	67.8	71.4	63.1	-	-	-	-	-
Falcon-180B	70.4	58.0	57.8	59.0	54.0	77.3	68.8	34.0
Yi-6B	63.2	75.5	72.0	72.2	42.8	72.3	68.7	19.8
Yi-34B	76.3	83.7	81.4	82.8	54.3	80.1	76.4	37.1

While benchmarking open-source models, we have observed a disparity between the results generated by our pipeline and those reported in public sources (e.g. OpenCompass). Upon conducting a more in-depth investigation of this difference, we have discovered that various models may employ different prompts, post-processing strategies, and sampling techniques, potentially resulting in significant variations in the outcomes. Our prompt and post-processing strategy remains consistent with the original benchmark, and greedy decoding is employed during evaluation without any post-processing for the generated content. For scores that were not reported by the original authors (including scores reported with different settings), we try to get results with our pipeline.

To evaluate the model's capability extensively, we adopted the methodology outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ were incorporated to evaluate reading comprehension. CSQA was exclusively tested using a 7-shot setup, while all other tests were conducted with a 0-shot configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score is derived by averaging the scores on the remaining tasks. Since the scores for these two tasks are generally lower than the average, we believe that Falcon-180B's performance was not underestimated.

01-ai/Yi-34B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

TheBloke/Yi-34B-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

TheBloke/Yi-34B-GPTQ · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Daniel Daugherty on LinkedIn: 01-ai/Yi-34B at main

01.AI just released two models which beat all other models of similar sizes: - a 6B which outperforms Mistral-7B - a 34B which outperforms…

www.linkedin.com

bnew · Nov 5, 2023

GitHub - KoljaB/LocalAIVoiceChat: Local AI talk with a custom voice based on Zephyr 7B model. Uses RealtimeSTT with faster_whisper for transcription and RealtimeTTS with Coqui XTTS for synthesis.

Local AI talk with a custom voice based on Zephyr 7B model. Uses RealtimeSTT with faster_whisper for transcription and RealtimeTTS with Coqui XTTS for synthesis. - GitHub - KoljaB/LocalAIVoiceChat:...

github.com

About

Local AI talk with a custom voice based on Zephyr 7B model. Uses RealtimeSTT with faster_whisper for transcription and RealtimeTTS with Coqui XTTS for synthesis.

Local AI Voice Chat

Provides talk in realtime with AI, completely local on your PC, with customizable AI personality and voice.

About the Project

Integrates the powerful Zephyr 7B language model with real-time speech-to-text and text-to-speech libraries to create a fast and engaging voicebased local chatbot.

https://user-images.githubusercontent.com/7604638/280487248-cebacdad-8a57-4a03-bfd1-a469730dda51.mov

Local.AI.Talkbot.GithubClip.mov

Tech Stack

llama_cppwith Zephyr 7B
- library interface for llamabased language models
RealtimeSTTwith faster_whisper
- real-time speech-to-text transcription library
RealtimeTTSwith Coqui XTTS
- real-time text-to-speech synthesis library

Notes

This software is in an experimental alpha state and does not provide production ready stability. The current XTTS model used for synthesis still has glitches and also Zephyr - while really good for a 7B model - of course can not compete with the answer quality of GPT 4, Claude or Perplexity.

Please take this as a first attempt to provide an early version of a local realtime chatbot.

Updates

Bugfix to RealtimeTTS (download of Coqui model did not work properly)

Prerequisites

You will need a GPU with around 8 GB VRAM to run this in real-time.

NVIDIA CUDA Toolkit 11.8:
- Access the NVIDIA CUDA Toolkit Archive.
- Choose version 11.x and follow the instructions for downloading and installation.
NVIDIA cuDNN 8.7.0 for CUDA 11.x:
- Navigate to NVIDIA cuDNN Archive.
- Locate and download "cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
- Follow the provided installation guide.
FFmpeg:
Install FFmpeg according to your operating system:
- Ubuntu/Debian:
  sudo apt update && sudo apt install ffmpeg
- Arch Linux:
  sudo pacman -S ffmpeg
- macOS (Homebrew):
  brew install ffmpeg
- Windows (Chocolatey):
  choco install ffmpeg
- Windows (Scoop):
  scoop install ffmpeg

Installation Steps

Clone the repository or download the source code package.
Run the install_win.bat script. This will automatically handle the installation of required dependencies and prepare your environment. There may be warnings about numpy or fsspec incompatibilies, but you can ignore them, it will work nevertheless.
If you are running UNIX or MAC you need to adjust this script (if someone with more experience under these platforms could mail me install-scripts, I would love to add them for these platforms).
Download zephyr-7b-beta.Q5_K_M.gguf from here.
- Open creation_params.json and enter the filepath to the downloaded model into model_path.
- Adjust n_gpu_layers (0-35, raise if you have more VRAM) and n_threads (number of CPU threads, i recommend not using all available cores but leave some for TTS)
Implement a temporary workaround for an issue in the Coqui TTS library:
- Activate your venv (test_env\Scripts\activate.bat under Windows, I think source test_env/bin/activate under Unix/Mac)
- Execute the command pip show tts to find the installation path of the Coqui TTS library.
- Navigate to the Coqui installation directory and proceed to TTS/tts/models.
- Locate and open the xtts.py file in a text editor with administrative or sufficient privileges.
- Within the handle_chunks method, modify the line if wav_overlap is not None: to if wav_overlap is not None and wav_chunk.shape[0] > 0:.
- Note: This modification addresses a specific runtime issue I encountered during working with the coqui library. Although it resolves the problem, it is a provisional solution. I did not consider a pull request submission to the Coqui TTS repository yet, because I honestly do not fully understand the underlying cause and full implications of the change to even document it well. This adjustment ensures functionality but should be approached with caution and technical oversight.

Running the Application

To start the AI voice chat, run start.bat

bnew · Nov 5, 2023

Large Language Models News & Discussions

Veteran

​

Computer Science > Computation and Language​

Teaching Models to Express Their Uncertainty in Words​

Submission history​

Veteran

How can AI better understand humans? Simple: by asking us questions​

Entering a new world of AI understanding through GATE​

EVENT​

The three GATES​

Promising results​

Veteran

LLMs have not learned our language — we’re trying to learn theirs​

What LLMs can’t learn​

EVENT​

Teaching LLMs to express uncertainty​

Discovering emergent abilities of LLMs​

Exploring the limits of LLMs in reasoning​

Guiding LLMs with better prompts​

Veteran

Computer Science > Computation and Language​

Do Large Language Models Know What They Don't Know?​

Submission history​

Veteran

Midjourney’s new style tuner is here. Here’s how to use it.​

Where to find Midjourney’s style tuner​

Selecting style directions and mode and what they mean​

Where to find the different styles to choose from​

Applying your freshly tuned style going forward to new images and prompts​

Veteran

Microsoft unveils ‘LeMa’: A revolutionary AI learning method mirroring human problem solving​

How LeMa works to enhance math reasoning​

LeMa yields impressive results on challenging datasets​

Broad Implications and Future Directions​

Veteran

Computer Science > Computation and Language​

Learning From Mistakes Makes LLM Better Reasoner​

Submission history​

Veteran

We still don't really understand what large language models are​

Veteran

Veteran

Computer Science > Computation and Language​

Large Language Models Understand and Can be Enhanced by Emotional Stimuli​

Telling GPT-4 you're scared or under pressure improves performance​

Researchers show LLMs respond with improved performance when prompted with emotional context​

Why Emotion Matters in AI​

​

Technical Insights: How LLMs Process Emotional Prompts​

​

The Study's Findings in Detail​

​

A Closer Look at the Experiments​

​

Quantitative Improvements​

​

Human Evaluation​

​

Implications of the Findings​

​

Caveats and Ethical Considerations​

​

Significance in Plain English​

​

Conclusion​

​

Veteran

Elon Musk gives a glimpse at xAI's Grok chatbot​

Veteran

Veteran

Introduction​

News​

Model Performance​

Veteran

About​

Local AI Voice Chat​

About the Project​

Tech Stack​

Notes​

Computer Science > Computation and Language

Teaching Models to Express Their Uncertainty in Words

Submission history

How can AI better understand humans? Simple: by asking us questions

Entering a new world of AI understanding through GATE

EVENT

The three GATES

Promising results

LLMs have not learned our language — we’re trying to learn theirs

What LLMs can’t learn

EVENT

Teaching LLMs to express uncertainty

Discovering emergent abilities of LLMs

Exploring the limits of LLMs in reasoning

Guiding LLMs with better prompts

Computer Science > Computation and Language

Do Large Language Models Know What They Don't Know?

Submission history

Midjourney’s new style tuner is here. Here’s how to use it.

Where to find Midjourney’s style tuner

Selecting style directions and mode and what they mean

Where to find the different styles to choose from

Applying your freshly tuned style going forward to new images and prompts

Microsoft unveils ‘LeMa’: A revolutionary AI learning method mirroring human problem solving

How LeMa works to enhance math reasoning

LeMa yields impressive results on challenging datasets

Broad Implications and Future Directions

Computer Science > Computation and Language

Learning From Mistakes Makes LLM Better Reasoner

Submission history

We still don't really understand what large language models are

Computer Science > Computation and Language

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

Telling GPT-4 you're scared or under pressure improves performance

Researchers show LLMs respond with improved performance when prompted with emotional context

Why Emotion Matters in AI

Technical Insights: How LLMs Process Emotional Prompts

The Study's Findings in Detail

A Closer Look at the Experiments

Quantitative Improvements

Human Evaluation

Implications of the Findings

Caveats and Ethical Considerations

Significance in Plain English

Conclusion

Elon Musk gives a glimpse at xAI's Grok chatbot

Introduction

News

Model Performance

About

Local AI Voice Chat

About the Project

Tech Stack

Notes

Updates

Prerequisites

Installation Steps

Running the Application