Prompt writing guide (May 2023)

Introduction

This is a collection of recommendations, best practises and mental models to keep in mind when writing prompts for LLMs, as well as designing systems that work using LLMs.

In most of the cases, only some of the recommendations are taken out from the source. Typically this is done for one of the following reasons:

  • Only part of the source relates to actual prompt writing and LLMs
  • In our experience we have found other recommendations from the source are not as good or relevant for our use-case.

Writing Prompts

One of the good guides on the topic is https://www.promptingguide.ai/. Some of the recommendations will be instructed below, but that guide is pretty short and it is recommended to read it in full. Also Open AI’s guide Best practices for prompt engineering with OpenAI API and Microsoft’s guide: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/advanced-prompt-engineering

Be specific, descriptive and as detailed as possible about the desired context, outcome, length, format, style, etc.

Be very specific about the instruction and task you want the model to perform. The more descriptive and detailed the prompt is, the better the results. This is particularly important when you have a desired outcome or style of generation you are seeking. There aren’t specific tokens or keywords that lead to better results. It’s more important to have a good format and descriptive prompt. In fact, providing examples in the prompt is very effective to get desired output in specific formats.

(from https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, https://www.promptingguide.ai/introduction/tips)

Articulate the desired output format through examples.

Less effective ❌:

Extract the entities mentioned in the text below. Extract the following 4 entity types: company names, people names, specific topics and themes.

Text: {text}

Show, and tell – the models respond better when shown specific format requirements. This also makes it easier to programmatically parse out multiple outputs reliably.

Better ✅:

Extract the important entities mentioned in the text below. First extract all company names, then extract all people names, then extract specific topics which fit the content and finally extract general overarching themes

Desired format:
Company names: <comma_separated_list_of_company_names>
People names: -||-
Specific topics: -||-
General themes: -||-

Text: {text}

(from https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Focus on what to do, not what not to do.

Less effective ❌:

The following is a conversation between an Agent and a Customer. DO NOT ASK USERNAME OR PASSWORD. DO NOT REPEAT.

Customer: I can’t log in to my account.
Agent:

Better ✅:

The following is a conversation between an Agent and a Customer. The agent will attempt to diagnose the problem and suggest a solution, whilst refraining from asking any questions related to PII. Instead of asking for PII, such as username or password, refer the user to the help article www.samplewebsite.com/help/faq

Customer: I can’t log in to my account.
Agent:

(from https://www.promptingguide.ai/introduction/tips, https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Put instructions at the beginning of the prompt and use ###, “””, or other delimiters to separate the instruction, context and any other logical parts of the prompt.

Less effective ❌:

Summarize the text below as a bullet point list of the most important points.

{text input here}

Better ✅:

Summarize the text below as a bullet point list of the most important points.

Text: """
{text input here}
"""

(From https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Reduce “fluffy” and imprecise descriptions

Less effective ❌:

The description for this product should be fairly short, a few sentences only, and not too much more.

Better ✅:

Use a 3 to 5 sentence paragraph to describe this product.

(from https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Use “leading words” to nudge the model toward a particular pattern / Prime the output

This refers to including a few words or phrases at the end of the prompt to obtain a model response that follows the desired form. For example, using a cue such as “Here’s a bulleted list of key points:\n- ” can help make sure the output is formatted as a list of bullet points.

See another coding focused example below.

Less effective ❌:

# Write a simple python function that
# 1. Asks me for a number in miles
# 2. It converts miles to kilometers

In this code example below, adding “import” hints to the model that it should start writing in Python. (Similarly “SELECT” is a good hint for the start of a SQL statement.)

Better ✅:

# Write a simple python function that
# 1. Asks me for a number in miles
# 2. It converts miles to kilometers 

import

Use action word

In the prompt, be clear on what you want the model to do. Basically, order it to do what is needed i.e. Create X, Write Y, etc.

Writing in past tense

This was observed by multiple people, i.e. https://twitter.com/stevemoraco/status/1651292545413169153?s=12&t=E0Rto2wK8TIXrfTXwbtbLg, but no academic supporting evidence was found, so this advice should be treated with caution. It is also contrary to the best practise of using action words.

When writing prompt, it may be useful to write it in a past tense, as if the answer has already been provided, i.e. instead of asking for something – write a prompt like the model already did answer your question

For example, instead of what we currently have in our prompts, i.e.

For a given query, output what statistical analysis method would allow analyst to answer that query.

we can write

For a given query you have provided, what statistical analysis method would allow an analyst to answer that query.

Words, grammar and semantic clarity are important

Even small changes in prompts (i.e. replacing one word with another) can have a big effect on the resulting performance.

For instance, “condense this,” is more powerful than, “rewrite this to be shorter.”. Sometimes it may even be useful to use thesaurus to find better words to express the intent you have.

It is also very important for the prompt to be semantically correct, especially in terms of the domain knowledge. If the prompt is worded in a way that is more typical for content usually used in the domain you are operating in (i.e. in our case, our prompts sounds like someone from financial world wrote them) – the output will be much better as well.

This also applies to overall “quality” of text. If the prompt contains mistakes, or is not correct english in some other way, the output of the model is likely to be not great as well. (garbage in = garbage out)

Ask for structured output

LLMs are typically good in formatting their output in the way you need it. Therefore it is preferable to ask for a structured output when you need one.

Repeat instructions in the end

Models can be susceptible to recency bias, which in this context means that information at the end of the prompt might have more significant influence over the output than information at the beginning of the prompt. Therefore, it is worth experimenting with repeating the instructions at the end of the prompt and evaluating the impact on the generated response.

Few Shot Prompting

Use Few Shot prompting (providing several examples) to improve the results

Few Shot prompting ([2005.14165] Language Models are Few-Shot Learners ) is a practise of providing several examples directly in the prompt. It has shown to significantly improve the quality of the model when very specific results are needed.

Example:

Extract keywords from the corresponding texts below.

Text 1: Stripe provides APIs that web developers can use to integrate payment processing into their websites and mobile applications.
Keywords 1: Stripe, payment processing, APIs, web developers, websites, mobile applications
##
Text 2: OpenAI has trained cutting-edge language models that are very good at understanding and generating text. Our API provides access to these models and can be used to solve virtually any task that involves processing language.
Keywords 2: OpenAI, language models, text processing, API.
##
Text 3: {text}
Keywords 3:

When Few Shot prompting still doesn’t provide good enough performance, it may be worth trying finetuning the model.

(from https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, [2005.14165] Language Models are Few-Shot Learners )

Even random labels in the examples still can help with performance

Additional recommendations for Few Shot prompting from [2202.12837] Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? :

  • “the label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)”
  • the format you use also plays a key role in performance, even if you just use random labels, this is much better than no labels at all.
  • additional results show that selecting random labels from a true distribution of labels (instead of a uniform distribution) also helps.

(from https://www.promptingguide.ai/techniques/fewshot)

Limitations of Few Shot Prompting

For some tasks Few Shot prompting is still not enough to show good performance. The types of task that involve more than a few reasoning steps are typically in this group. In other words, it might help if we break the problem down into steps and demonstrate that to the model. More recently, Chain of Thought prompting  has been popularized to address more complex arithmetic, commonsense, and symbolic reasoning tasks.

Chain of Thought prompting

Chain of Thought prompting: show examples together with intermediate reasoning steps

Introduced in [2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.

Example:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.
The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: Adding all the odd numbers (11, 13) gives 24. The answer is True.
The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Zero Shot Chain of Thought prompting: just ask for reasoning steps

Introduced in [2205.11916] Large Language Models are Zero-Shot Reasoners With more advanced models (i.e. GPT4) it is often enough to just ask the model for its reasoning steps, i.e. by adding “Let’s think step by step” to the prompt.

Examples:

Self Consistency aka running same prompt multiple times and choosing best answer

Discussed in [2203.11171] Self-Consistency Improves Chain of Thought Reasoning in Language Models, the approach, in a nutshell, involves sampling the model for multiple answers, that involve multiple reasoning paths, and then pick out the most consistent answer (what the authors marginalizing out).

Generated Knowledge Prompting

In a nutshell, this approach involves first generating “knowledge” related to the question, and then including that knowledge in the prompt to actually answer some question. Originally proposed in [2110.08387] Generated Knowledge Prompting for Commonsense Reasoning.

The approach is the following – instead of prompting for actual answer directly, use a few shot prompting to ask model to generate relevant knowledge (any relevant information).

Then, in the next prompt, integrate the knowledge generated in the previous one in the prompt as context.

Tree of Thoughts prompting

Introduced in [2305.10601] Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

This is a complex approach, that involves multiple prompts and multiple stages. In a nutshell, the approach consists of the following steps:

  1. Thought decomposition: take a task, and decompose it into a list of intermediary steps
  2. Thought generation: generate thoughts (i.e. prompts) for a given state.
  3. State evaluation: for different branches of the logic tree, evaluate how relevant it is, so that branches that are not relevant anymore can be “pruned”
  4. Search: Find the best possible branch of the logic tree to arrive to an answer.

On a high level, one can think about this approach as literally building a tree of thoughts as opposed to a linear approach that Chain of Thoughts suggests.

Code-like prompting

Nothing will explain it better than abstract and illustration from the paper that introduced it ([2304.13250] Exploring the Curious Case of Code Prompts )

[2304.13250] Exploring the Curious Case of Code Prompts

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some but not all tasks and that finetuning on text instructions leads to better relative performance of code prompts.

It looks like while code like representation of prompts may be useful for some structured reasoning tasks, in general it may not be the best option.

Building LLM products

What is temperature?

temperature A measure of how often the model outputs a less likely token. The higher the temperature, the more random (and usually creative) the output. This, however, is not the same as “truthfulness”. For most factual use cases such as data extraction, and truthful Q&A, the temperature of 0 is best. (from https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Schillace Laws of Semantic AI

These are not recommendations, but are more a high level model to use when building systems based on LLMs.

Source is https://learn.microsoft.com/en-us/semantic-kernel/when-to-use-ai/schillace-laws

  1. Don’t write code if the model can do it; the model will get better, but the code won’t. The overall goal of the system is to build very high leverage programs using the LLM’s capacity to plan and understand intent. It’s very easy to slide back into a more imperative mode of thinking and write code for aspects of a program. Resist this temptation – to the degree that you can get the model to do something reliably now, it will be that much better and more robust as the model develops.
  2. Trade leverage for precision; use interaction to mitigate. Related to the above, the right mindset when coding with an LLM is not “let’s see what we can get the dancing bear to do,” it’s to get as much leverage from the system as possible. For example, it’s possible to build very general patterns, like “build a report from a database” or “teach a year of a subject” that can be parameterized with plain text prompts to produce enormously valuable and differentiated results easily.
  3. Code is for syntax and process; models are for semantics and intent. There are lots of different ways to say this, but fundamentally, the models are stronger when they are being asked to reason about meaning and goals, and weaker when they are being asked to perform specific calculations and processes. For example, it’s easy for advanced models to write code to solve a sudoku generally, but hard for them to solve a sudoku themselves. Each kind of code has different strengths and it’s important to use the right kind of code for the right kind of problem. The boundaries between syntax and semantics are the hard parts of these programs.
  4. The system will be as brittle as its most brittle part. This goes for either kind of code. Because we are striving for flexibility and high leverage, it’s important to not hard code anything unnecessarily. Put as much reasoning and flexibility into the prompts and use imperative code minimally to enable the LLM.
  5. Ask Smart to Get Smart. Emerging LLM AI models are incredibly capable and “well educated” but they lacks context and initiative. If you ask them a simple or open-ended question, you will get a simple or generic answer back. If you want more detail and refinement, the question has to be more intelligent. This is an echo of “Garbage in, Garbage out” for the AI age.
  6. Uncertainty is an exception throw. Because we are trading precision for leverage, we need to lean on interaction with the user when the model is uncertain about intent. Thus, when we have a nested set of prompts in a program, and one of them is uncertain in its result (“One possible way…”) the correct thing to do is the equivalent of an “exception throw” – propagate that uncertainty up the stack until a level that can either clarify or interact with the user.
  7. Text is the universal wire protocol. Since the LLMs are adept at parsing natural language and intent as well as semantics, text is a natural format for passing instructions between prompts, modules and LLM based services. Natural language is less precise for some uses, and it is possible to use structured language like XML sparingly, but generally speaking, passing natural language between prompts works very well, and is less fragile than more structured language for most uses. Over time, as these model-based programs proliferate, this is a natural “future proofing” that will make disparate prompts able to understand each other, the same way humans do.
  8. Hard for you is hard for the model. One common pattern when giving the model a challenging task is that it needs to “reason out loud.” This is fun to watch and very interesting, but it’s problematic when using a prompt as part of a program, where all that is needed is the result of the reasoning. However, using a “meta” prompt that is given the question and the verbose answer and asked to extract just the answer works quite well. This is a cognitive task that would be easier for a person (it’s easy to imagine being able to give someone the general task of “read this and pull out whatever the answer is” and have that work across many domains where the user had no expertise, just because natural language is so powerful). So, when writing programs, remember that something that would be hard for a person is likely to be hard for the model, and breaking patterns down into easier steps often gives a more stable result.
  9. Beware “pareidolia of consciousness”; the model can be used against itself.” It is very easy to imagine a “mind” inside an LLM. But there are meaningful differences between human thinking and the model. An important one that can be exploited is that the models currently don’t remember interactions from one minute to the next. So, while we would never ask a human to look for bugs or malicious code in something they had just personally written, we can do that for the model. It might make the same kind of mistake in both places, but it’s not capable of “lying” to us because it doesn’t know where the code came from to begin with. This means we can “use the model against itself” in some places – it can be used as a safety monitor for code, a component of the testing strategy, a content filter on generated content, etc.

Leave a comment