In this article, authors Numa Dhamani and Maggie Engler discuss how prompt engineering techniques can help use the large language models (LLMs) more effectively to achieve better results. Prompting techniques discussed include few-shot, chain-of-thought, self-consistency, and tree-of-thoughts...
www.infoq.com
Maximizing the Utility of Large Language Models (LLMs) through Prompting
JAN 19, 2024 14 MIN READ
by
Numa Dhamani
Principal Machine Learning Engineer @Kungfu
Maggie Engler
Engineer & Researcher on LLMs @Inflection
Key Takeaways
- Prompt engineering is about experimenting with changes in prompts to understand their impacts on what large language models (LLMs) generate as the output. Prompt engineering yields better outcomes for LLM use with a few basic techniques
- Zero-shot prompting is when an LLM is given a task, via prompt, for which the model has not previously seen data
- For the language tasks in the literature, performance improves with a few examples, this is known as few-shot prompting
- Chain-of-Thought (CoT) prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can't be solved with zero-shot or few-shot prompting
- Built upon CoT, self-consistency prompting is an advanced prompting technique, that provides the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses
A new job title, "prompt engineer," has made waves in tech media recently, bursting onto the careers pages of top AI companies with promises of eye-wateringly high salaries. But what even is
prompt engineering? The term itself was coined only within the past few years and refers to the art and science of prompting
large language models (LLMs) to achieve better results.
These LLMs, which include the popular
ChatGPT and
GPT4 from OpenAI,
Claude from Anthropic, and
Bard from Google, can be used for all types of language tasks, from drafting emails and documents to summarizing or classifying text.
Prompting, essentially, refers to the practice of giving inputs to the LLMs, typically a set of instructions describing how to perform the desired task. For example, to use ChatGPT to draft a cover letter based on a job description, one might use the prompt,
"You are an applicant applying for the following job. Write a cover letter explaining why you are well-suited for the role."
Easy enough, right? Here's where it gets tricky: researchers have found that the results provided by LLMs depend heavily on the specific prompt given, and so even though it might seem trivial to explain a well-defined task like writing a cover letter, simple adjustments like wording and formatting changes can substantially affect the output received.
Prompt engineering is fundamentally experimenting with changes in prompts to understand their impacts on what the model generates, and therefore doesn't require an advanced technical background - merely a little curiosity and creativity.
Furthermore, prompt engineering is not just the purview of highly paid programmers. We argue that everyone who uses LLMs can, and should, be a prompt engineer. The most basic reason is that prompt engineering will yield better outcomes for LLM use, and with even a few basic techniques, it's possible to measurably improve performance on many common tasks.
Few-Shot Prompting
Let's consider a task for which someone might want to employ an LLM: classifying customer reviews by sentiment. When an LLM is given a task, via prompt, for which the model has not previously seen data, this is referred to as
zero-shot prompting. In this case, the task would be a zero-shot classification task. The zero-shot prompt might be something like the figure shown below, where the user gives instructions on classifying movie reviews to Google's Bard.
This might work reasonably well on its own, but across a variety of language tasks in the literature, performance nearly always improves with a few examples, or a
few-shot prompt. In this case, the few-shot prompt might start the same way, but append several examples of the expected output. Again, we see that Bard generates a response of "POSITIVE" for the final unlabeled review.
More precisely, this prompt would be two-shot, referring to the fact that there are two examples provided to the model. The fact that providing examples in this way measurably improves the model's ability to complete the task is unique to LLMs; as a result, unlike other machine learning models, LLMs are able to perform a wide variety of tasks that they are not trained to do.
However, every approach comes with its own limitations. While few-shot prompting is an effective technique for numerous tasks, it tends to fall short when addressing more complex reasoning tasks. Let’s look at an
example. In the figure below, we provided ChatGPT with a multi-step reasoning task: do the odd numbers in the given group add up to an even number?
Even though we used few-shot prompting to provide the LLM with multiple examples, it did not give us the correct answer. To solve this task, we need to first determine the odd numbers. Then, we need to add the odd numbers together, and finally, we need to determine if the resulting number is odd or even. That is a
multi-step reasoning task, so it might help if we break the problem down into multiple steps and demonstrate it to the model.
Chain-of-Thought Prompting
This is where
Chain-of-Thought (CoT) prompting comes in. Introduced by Google researchers, CoT prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can’t be solved with zero-shot or few-shot prompting. Here, instead of directly asking the LLM to solve the problem, we would use CoT prompting to guide the model through the intermediate reasoning tasks. Let’s revisit the previous
example.
By providing the intermediate reasoning step, ChatGPT is able to get the correct answer with only one example. This technique can be extremely beneficial for complex multi-step tasks - if the target task involves multiple steps or sequential reasoning, CoT can help break down the problem into smaller and more manageable tasks. A more recent idea is
zero-shot CoT, where you ask the LLM to think through the intermediate reasoning tasks itself by adding "let’s think step by step" to the original prompt. Let’s demonstrate this with our example again.
Without any examples, the LLM is able to correctly perform the multi-step reasoning task with a simple prompt, which it was previously unable to do with few-shot prompting. CoT prompting is a useful technique for multi-step reasoning problems, tasks that benefit from intermediate explanations, or simply, tasks for which standard prompting techniques are insufficient.
Self-Consistency Prompting
Built upon CoT,
self-consistency prompting is an advanced prompting technique, also
introduced by Google researchers. This approach aims to provide the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses. Self-consistency prompting is more powerful when used in conjunction with CoT and boosts its performance with complex reasoning tasks. Let’s look at an example from the
research paper demonstrating how the self-consistency approach works. First, you prompt the LLM using CoT prompting. Then, you generate a diverse set of responses, and finally, you select the most consistent response as your answer. By utilizing a majority voting system, the model is more likely to arrive at an accurate and reliable answer.
The three steps in the self-consistency method: 1) prompt the language model using CoT prompting, 2) sample a diverse set of reasoning paths from the language model, and 3) marginalize out reasoning paths to aggregate final answers and choose the most consistent answer. ( source)
Let’s look at another example of analyzing emails. Suppose that a software consulting company receives hundreds of emails every day, but often, most of them are promotional content or spam. They want to classify emails as IMPORTANT or NOT IMPORTANT.
Here is an example email that the company might receive:
Hello!
I trust this email finds you well. We are currently facing a critical challenge within our organization and believe your expertise could provide valuable insights. Given your successful track record in software engineering, we believe your team's insights and recommendations could be instrumental in resolving this matter.
Could we please schedule a meeting at your earliest convenience to discuss the details and explore how we might collaborate to address this challenge?
Your prompt attention to this matter is greatly appreciated.
Best regards,
John Doe
Now, we can put this email into a prompt and ask the LLM to "Classify the above email as IMPORTANT or NOT IMPORTANT as it relates to a software consulting company. Let’s think step by step." Then, we can use this prompt to generate multiple responses and then select the most consistent response.
RESPONSE: The email explicitly mentions facing a critical challenge within the organization, which signifies the importance of the matter. I would classify this email as IMPORTANT.
RESPONSE: The email mentions facing a critical challenge within the organization, but they have not described the issue or challenge. I would classify this email as NOT IMPORTANT.
RESPONSE: While not explicitly stated, the tone of the email and the request for prompt attention suggest a sense of urgency, emphasizing the importance of a timely response. I would classify this email as IMPORTANT.
By exploring multiple paths, the LLM is likely to arrive at a correct or well-informed response. Additionally, it could potentially help counteract biases that may be inherent in the model or the data it is trained on. Although self-consistency prompting has
shown an improvement in results when CoT alone has been ineffective, it falls short with tasks that require exploration or strategic lookahead.