bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

https://web.archive.org/web/20230430105900/https://twitter.com/weights_biases/status/1651375841899577350
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798










 

Macallik86

Superstar
Supporter
Joined
Dec 4, 2016
Messages
6,510
Reputation
1,372
Daps
21,235
@bnew are you using any of the open-source models? I don't have the PC power for a local setup, but I was today years old when I found out that they have web-facing options that anyone can utilize:

I think my usage will likely mirror the way I use search engines. For example, I primarily search through Ecosia (powered by Bing). It's less accurate than google but it has an altruistic motive so I force myself to use it for everyday queries. The second option is Whoogle (powered by Google), solely for more nuanced/accurate searches.

I'm thinking the same will apply for my LLMs. I'm thinking I'll try to use the open-sourced models for as much as possible so that they get better, and then utilize Bard/Edge (or ChatGPT) for more nuanced questions that require the best models available.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798
@bnew are you using any of the open-source models? I don't have the PC power for a local setup, but I was today years old when I found out that I can just query them via their websites:

I think my usage will likely mirror the way I use search engines. For example, I primarily search through Ecosia (powered by Bing). I use it for most of my everyday queries. It's less accurate than google but it has an altruistic motive so I force myself to use it. The second option is Whoogle (powered by Google) for more nuanced searches (albeit privacy-focused) that is more accurate.

I'm thinking the same will apply for my LLMs. I'm thinking I'll try to use the open-sourced models for as much as possible so that they get better, and then utilize Bard/Edge (or ChatGPT) for more nuanced questions that require the best models available.

I've downloaded like 2 dozen or so models but haven't run them locally yet, I use the open source ones online since they do what I need and most of the models I've downloaded have online demos I can use anytime.

try the same prompt in several different models because the answers will be different.

 

Macallik86

Superstar
Supporter
Joined
Dec 4, 2016
Messages
6,510
Reputation
1,372
Daps
21,235
I've downloaded like 2 dozen or so models but haven't run them locally yet, I use the open source ones online since they do what I need and most of the models I've downloaded have online demos I can use anytime.

try the same prompt in several different models because the answers will be different.

Oh wow, all of the open-source models in one website 😍. There goes my wknd
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

Numbers every LLM Developer should know​

At Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know. It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage.

Notes on the Github version​

Last updates: 2023-05-17

If you feel there's an issue with the accuracy of the numbers, please file an issue. Think there are more numbers that should be in this doc? Let us know or file a PR.

We are thinking the next thing we should add here is some stats on tokens per second of different models.

Prompts​

40-90%1: Amount saved by appending “Be Concise” to your prompt​

It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money.

1.3:1 -- Average tokens per word​

LLMs operate on tokens. Tokens are words or sub-parts of words, so “eating” might be broken into two tokens “eat” and “ing”. A 750 word document in English will be about 1000 tokens. For languages other than English, the tokens per word increases depending on their commonality in the LLM's embedding corpus.

Knowing this ratio is important because most billing is done in tokens, and the LLM’s context window size is also defined in tokens.

Prices2

Prices are of course subject to change, but given how expensive LLMs are to operate, the numbers in this section are critical. We use OpenAI for the numbers here, but prices from other providers you should check out (Anthropic, Cohere) are in the same ballpark.

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

What this means is that for many practical applications, it’s much better to use GPT-4 for things like generation and then use that data to fine tune a smaller model. It is roughly 50 times cheaper to use GPT-3.5-Turbo than GPT-4 (the “roughly” is because GPT-4 charges differently for the prompt and the generated output) – so you really need to check on how far you can get with GPT-3.5-Turbo. GPT-3.5-Turbo is more than enough for tasks like summarization for example.

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding​

This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding​

Note: this number is sensitive to load and embedding batch size, so please consider this approximate.
In our blog post, we noted that using a g4dn.4xlarge (on-demand price: $1.20/hr) we were able to embed at about 9000 tokens per second using Hugging Face’s SentenceTransformers (which are pretty much as good as OpenAI’s embeddings). Doing some basic math of that rate and that node type indicates it is considerably cheaper (factor of 10 cheaper) to self-host embeddings (and that is before you start to think about things like ingress and egress fees).

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries​

It costs you 6 times as much to serve a fine tuned model as it does the base model on OpenAI. This is pretty exorbitant, but might make sense because of the possible multi-tenancy of base models. It also means it is far more cost effective to tweak the prompt for a base model than to fine tune a customized model.

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries​

If you’re self hosting a model, then it more or less costs the same amount to serve a fine tuned model as it does to serve a base one: the models have the same number of parameters.

Training and Fine Tuning​

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens​

The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs. We considered training our own model on the Red Pajama training set, then we ran the numbers. The above is assuming everything goes right, nothing crashes, and the calculation succeeds on the first time, etc. Plus it involves the coordination of 2048 GPUs. That’s not something most companies can do (shameless plug time: of course, we at Anyscale can – that’s our bread and butter! Contact us if you’d like to learn more). The point is that training your own LLM is possible, but it’s not cheap. And it will literally take days to complete each run. Much cheaper to use a pre-trained model.

< 0.001: Cost ratio of fine tuning vs training from scratch​

This is a bit of a generalization, but the cost of fine tuning is negligible. We showed for example that you can fine tune a 6B parameter model for about $7. Even at OpenAI’s rate for its most expensive fine-tunable model, Davinci, it is 3c per 1000 tokens. That means to fine tune on the entire works of Shakespeare (about 1 million words), you’re looking at $405. However, fine tuning is one thing and training from scratch is another …

GPU Memory​

If you’re self-hosting a model, it’s really important to understand GPU memory because LLMs push your GPU’s memory to the limit. The following statistics are specifically about inference. You need considerably more memory for training or fine tuning.

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities​

It may seem strange, but it’s important to know the amount of memory different types of GPUs have. This will cap the number of parameters your LLM can have. Generally, we like to use A10Gs because they cost $1.50 to $2 per hour each at AWS on-demand prices and have 24G of GPU memory, vs the A100s which will run you about $5 each at AWS on-demand prices.

2x number of parameters: Typical GPU memory requirements of an LLM for serving​

For example, if you have a 7 billion parameter model, it takes about 14GB of GPU space. This is because most of the time, one 16-bit float (or 2 bytes) is required per parameter. There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy you start to lose resolution (though that may be acceptable in some cases). Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.

~1GB: Typical GPU memory requirements of an embedding model​

Whenever you are doing sentence embedding (a very typical thing you do for clustering, semantic search and classification tasks), you need an embedding model like sentence transformers. OpenAI also has its own embeddings that they provide commercially.

You typically don’t have to worry about how much memory embeddings take on the GPU, they’re fairly small. We’ve even had the embedding and the LLM on the same GPU.

>10x: Throughput improvement from batching LLM requests​

Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second. The funny thing is, though, if you run two tasks, it might only take 5.2 seconds. This means that if you can bundle 25 queries together, it would take about 10 seconds, and our throughput has improved to 2.5 queries per second. However, see the next point.

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model​

The amount of memory you need is directly proportional to the maximum number of tokens you want to generate. So for example, if you want to generate outputs of up to 512 tokens (about 380 words), you need 512MB. No big deal you might say – I have 24GB to spare, what’s 512MB? Well, if you want to run bigger batches it starts to add up. So if you want to do batches of 16, you need 8GB of space. There are some techniques being developed that overcome this, but it’s still a real issue.

Cheatsheet​

Screenshot 2023-05-17 at 1 46 09 PM
 

storyteller

Superstar
Joined
May 23, 2012
Messages
16,381
Reputation
5,004
Daps
62,428
Reppin
NYC
Speak of the devil, yesterday Replika doubled back to their erotic AI after the incels got ornery about the functionality being removed:
I guess I forgot to link it, but we discussed this story on the podcast.




We've had a bunch of AI stuff because the stories are hella interesting and bugged out. Just put out this short clip about Geoffrey Hinton "Godfather of AI" being startled after an AI explained to him what makes jokes funny.


And our next episode will drop with coverage and discussion on a positive story for a change. AI helped researchers discover an antibiotic that can beat a superbug resistant to other antibiotics. It also seems specialized in a way to prevent the bug from developing resistance to it and uses a method of delivery that's completely unique.

This last story's clip won't be out for a while though, still going through edits and all that.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

Function calling and other API updates​

We’re announcing updates including more steerable API models, function calling capabilities, longer context, and lower prices.

June 13, 2023

Authors​

We released gpt-3.5-turbo and gpt-4 earlier this year, and in only a short few months, have seen incredible applications built by developers on top of these models.

Today, we’re following up with some exciting updates:
  • new function calling capability in the Chat Completions API
  • updated and more steerable versions of gpt-4 and gpt-3.5-turbo
  • new 16k context version of gpt-3.5-turbo (vs the standard 4k version)
  • 75% cost reduction on our state-of-the-art embeddings model
  • 25% cost reduction on input tokens for gpt-3.5-turbo
  • announcing the deprecation timeline for the gpt-3.5-turbo-0301 and gpt-4-0314 models
All of these models come with the same data privacy and security guarantees we introduced on March 1 — customers own all outputs generated from their requests and their API data will not be used for training.

Function calling​

Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT's capabilities with external tools and APIs.
These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model. For example, developers can:
  • Create chatbots that answer questions by calling external tools (e.g., like ChatGPT Plugins)
Convert queries such as “Email Anya to see if she wants to get coffee next Friday” to a function call like send_email(to: string, body: string), or “What’s the weather like in Boston?” to get_current_weather(location: string, unit: 'celsius' | 'fahrenheit').
  • Convert natural language into API calls or database queries
Convert “Who are my top ten customers this month?” to an internal API call such as get_customers_by_revenue(start_date: string, end_date: string, limit: int), or “How many orders did Acme, Inc. place last month?” to a SQL query using sql_query(query: string).
  • Extract structured data from text
Define a function called extract_people_data(people: [{name: string, birthday: string, location: string}]), to extract all people mentioned in a Wikipedia article.
These use cases are enabled by new API parameters in our /v1/chat/completions endpoint, functions and function_call, that allow developers to describe functions to the model via JSON Schema, and optionally ask it to call a specific function. Get started with our developer documentation and add evals if you find cases where function calling could be improved

Function calling example​


What’s the weather like in Boston right now?

Step 1·OpenAI API
Call the model with functions and the user’s input

Step 2·Third party API
Use the model response to call your API

Step 3·OpenAI API
Send the response back to the model to summarize


The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.

Since the alpha release of ChatGPT plugins, we have learned much about making tools and language models work together safely. However, there are still open research questions. For example, a proof-of-concept exploit illustrates how untrusted data from a tool’s output can instruct the model to perform unintended actions. We are working to mitigate these and other risks. Developers can protect their applications by only consuming information from trusted tools and by including user confirmation steps before performing actions with real-world impact, such as sending an email, posting online, or making a purchase.

New models​

GPT-4​

gpt-4-0613 includes an updated and improved model with function calling.
gpt-4-32k-0613 includes the same improvements as gpt-4-0613, along with an extended context length for better comprehension of larger texts.
With these updates, we’ll be inviting many more people from the waitlist to try GPT-4 over the coming weeks, with the intent to remove the waitlist entirely with this model. Thank you to everyone who has been patiently waiting, we are excited to see what you build with GPT-4!

GPT-3.5 Turbo​

gpt-3.5-turbo-0613 includes the same function calling as GPT-4 as well as more reliable steerability via the system message, two features that allow developers to guide the model's responses more effectively.
gpt-3.5-turbo-16k offers 4 times the context length of gpt-3.5-turbo at twice the price: $0.003 per 1K input tokens and $0.004 per 1K output tokens. 16k context means the model can now support ~20 pages of text in a single request.

Model deprecations​

Today, we’ll begin the upgrade and deprecation process for the initial versions of gpt-4 and gpt-3.5-turbo that we announced in March. Applications using the stable model names (gpt-3.5-turbo, gpt-4, and gpt-4-32k) will automatically be upgraded to the new models listed above on June 27th. For comparing model performance between versions, our Evals library supports public and private evals to show how model changes will impact your use cases.

Developers who need more time to transition can continue using the older models by specifying gpt-3.5-turbo-0301, gpt-4-0314, or gpt-4-32k-0314 in the ‘model’ parameter of their API request. These older models will be accessible through September 13th, after which requests specifying those model names will fail. You can stay up to date on model deprecations via our model deprecation page. This is the first update to these models; so, we eagerly welcome developer feedback to help us ensure a smooth transition.

Lower pricing​

We continue to make our systems more efficient and are passing those savings on to developers, effective today.

Embeddings​

text-embedding-ada-002 is our most popular embeddings model. Today we’re reducing the cost by 75% to $0.0001 per 1K tokens.

GPT-3.5 Turbo​

gpt-3.5-turbo is our most popular chat model and powers ChatGPT for millions of users. Today we're reducing the cost of gpt-3.5-turbo’s input tokens by 25%. Developers can now use this model for just $0.0015 per 1K input tokens and $0.002 per 1K output tokens, which equates to roughly 700 pages per dollar.
gpt-3.5-turbo-16k will be priced at $0.003 per 1K input tokens and $0.004 per 1K output tokens.
Developer feedback is a cornerstone of our platform’s evolution and we will continue to make improvements based on the suggestions we hear. We’re excited to see how developers use these latest models and new features in their applications.
 
Top