The A.I Megathread (LLM , GPT , Development)

bnew · May 20, 2023

Chinese author shocks audience after revealing he used ChatGPT to write speech

‘I was supposed to write a commendation for [Yu Hua] as per tradition, but I struggled for several days,’ novelist said

www.independent.co.uk

Chinese Nobel laureate Mo Yan shocks audience after revealing he used ChatGPT to write speech

‘I was supposed to write a commendation for [Yu Hua] as per tradition, but I struggled for several days,’ novelist said

Peony Hirwani
1 day ago
Comments

Chinese Nobel laureate Mo Yan revealed he used ChatGPT to write a speech to praise fellow author Yu Hua.

This week, the 68-year-old novelist presented a book award to Yu at the Shanghai Dance Centre during the 65th-anniversary celebration of Shouhuo magazine.

“The person who is receiving this award is truly remarkable and, of course, he is also my good friend. He is extraordinary, so I must be too,” Mo said during his speech.

“A few days ago, I was supposed to write a commendation for him as per tradition, but I struggled for several days and couldn’t come up with anything. So I asked a doctoral student to help me by using ChatGPT.”

According to South China Morning Post, there was an “audible gasp” from the audience when they found out that the Nobel Prize winner crafted his speech using artificial intelligence.

The Independent attempted reaching out to Mo’s representatives for comment.

Mo is a Chinese novelist and short story writer.

In 2012, he was awarded the Nobel Prize in Literature for his work as a writer “who with hallucinatory realism merges folk tales, history and the contemporary”.

(AFP via Getty Images)

He is best known to global readers for his 1986 novel Red Sorghum, the first two parts of which were adapted into the Golden Bear-winning film of the same name.

The author won the 2005 International Nonino Prize in Italy. In 2009, he was the first recipient of the University of Oklahoma’s Newman Prize for Chinese Literature.

So far, Mo has written 11 novels, and several novellas and short story collections.

A video of Mo’s speech at the Shanghai Dance Centre has gone viral on the Chinese microblogging website Weibo.

In the comments section, some people pointed out that Mo could face legal trouble for mentioning ChatGPT as the service has not yet been made available in China.

bnew · May 20, 2023

GitHub - TencentARC/MasaCtrl: [ICCV 2023] Consistent Image Synthesis and Editing

[ICCV 2023] Consistent Image Synthesis and Editing - TencentARC/MasaCtrl

github.com

About

Consistent Image Synthesis and Editing

ljzycmd.github.io/projects/MasaCtrl/

MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Pytorch implementation of MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, Yinqiang Zheng

MasaCtrl enables performing various consistent non-rigid image synthesis and editing without fine-tuning and optimization.

Updates

[2023/5/13] The inference code of MasaCtrl with T2I-Adapter is available.
[2023/4/28] Hugging Face demo released.
[2023/4/25] Code released.
[2023/4/17] Paper is available here.

Introduction

We propose MasaCtrl, a tuning-free method for non-rigid consistent image synthesis and editing. The key idea is to combine the contents from the source image and the layout synthesized from text prompt and additional controls into the desired synthesized or edited image, with Mutual Self-Attention Control.

Main Features

1 Consistent Image Synthesis and Editing

MasaCtrl can perform prompt-based image synthesis and editing that changes the layout while maintaining contents of source image.

The target layout is synthesized directly from the target prompt.

Consistent synthesis results

Real image editing results

2 Integration to Controllable Diffusion Models

Directly modifying the text prompts often cannot generate target layout of desired image, thus we further integrate our method into existing proposed controllable diffusion pipelines (like T2I-Adapter and ControlNet) to obtain stable synthesis and editing results.

The target layout controlled by additional guidance.

Synthesis (left part) and editing (right part) results with T2I-Adapter

3 Generalization to Other Models: Anything-V4

Our method also generalize well to other Stable-Diffusion-based models.

Results on Anything-V4

bnew · May 20, 2023

GitHub - ashen-sensored/sd_webui_masactrl

Contribute to ashen-sensored/sd_webui_masactrl development by creating an account on GitHub.

github.com

Implementation of MasaCtrl in webui

[2304.08465] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

https://archive.is/MjJVX

Search (#MasaCtrl)

nitter.net

bnew · May 20, 2023

GitHub - google/prompt-to-prompt

Contribute to google/prompt-to-prompt development by creating an account on GitHub.

github.com

Latent Diffusion and Stable Diffusion Implementation

New: Code for Null-Text Inversion is now provided here

Project Page Paper

newarkhiphop · May 20, 2023

The new chatgpt pro with web browsing and plug-ins is :wow:

bnew · May 20, 2023

GitHub - TencentARC/T2I-Adapter: T2I-Adapter

T2I-Adapter. Contribute to TencentARC/T2I-Adapter development by creating an account on GitHub.

github.com

CoAdapter:

| T2I-Adapter:

Demos |

Download Models |

How to Test |

Adapter Zoo

Official implementation of T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models.

New Features/Updates

Mar. 16, 2023. We add CoAdapter (Composable Adapter). The online Huggingface Gadio has been updated . You can also try the local gradio demo.
Mar. 16, 2023. We have shrunk the git repo with bfg. If you encounter any issues when pulling or pushing, you can try re-cloning the repository. Sorry for the inconvenience.
Mar. 3, 2023. Add a color adapter (spatial palette), which has only 17M parameters.
Mar. 3, 2023. Add four new adapters style, color, openpose and canny. See more info in the Adapter Zoo.
Feb. 23, 2023. Add the depth adapter t2iadapter_depth_sd14v1.pth. See more info in the Adapter Zoo.
Feb. 15, 2023. Release T2I-Adapter.

Support CoAdapter (Composable Adapter).
You can find the details and demos about CoAdapter from coadapter.md

Introduction

We propose T2I-Adapter, a simple and small (~70M parameters, ~300M storage space) network that can provide extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models.

T2I-Adapter aligns internal knowledge in T2I models with external control signals. We can train various adapters according to different conditions, and achieve rich control and editing effects.

Download Models

Put the downloaded models in the T2I-Adapter/models folder.

You can find the pretrained T2I-Adapters, CoAdapters, and third party models from TencentARC/T2I-Adapter · Hugging Face.
A base SD model is still needed to inference. We recommend to use Stable Diffusion v1.5. But please note that the adapters should work well on other SD models which are finetuned from SD-V1.4 or SD-V1.5. You can download these models from HuggingFace or civitai, all the following tested models (e.g., Anything anime model) can be found in there.
[Optional] If you want to use mmpose adapter, you need to download the pretrained keypose detection models include FasterRCNN (human detection) and HRNet (pose detection).

bnew · May 21, 2023

https://archive.is/K72I2

https://arxiv.org/pdf/2305.09617.pdf

Sharing Google’s Med-PaLM 2 medical large language model, or LLM | Google Cloud Blog

We’ve invited a select group of Google Cloud customers to test our Med-PaLM 2 LLM to evaluate how it answers complex medical questions.

cloud.google.com

A responsible path to generative AI in healthcare

April 13, 2023

https://storage.googleapis.com/gweb-cloudblog-publish/images/GCP_x_Health_hero.max-2600x2600.jpg

Aashima Gupta

Global Director of Healthcare Strategy & Solutions, Google Cloud

Amy Waldron

Global Director of Health Plan Strategy & Solutions, Google Cloud

Healthcare breakthroughs change the world and bring hope to humanity through scientific rigor, human insight, and compassion. We believe AI can contribute to this, with thoughtful collaboration between researchers, healthcare organizations and the broader ecosystem.

Today, we're sharing exciting progress on these initiatives, with the announcement of limited access to Google’s medical large language model, or LLM, called Med-PaLM 2. It will be available in coming weeks to a select group of Google Cloud customers for limited testing, to explore use cases and share feedback as we investigate safe, responsible, and meaningful ways to use this technology.

Med-PaLM 2 harnesses the power of Google’s LLMs, aligned to the medical domain to more accurately and safely answer medical questions. As a result, Med-PaLM 2 was the first LLM to perform at an “expert” test-taker level performance on the MedQA dataset of US Medical Licensing Examination (USMLE)-style questions, reaching 85%+ accuracy, and it was the first AI system to reach a passing score on the MedMCQA dataset comprising Indian AIIMS and NEET medical examination questions, scoring 72.3%.

Industry-tailored LLMs like Med-PaLM 2 are part of a burgeoning family of generative AI technologies that have the potential to significantly enhance healthcare experiences. We’re looking forward to working with our customers to understand how Med-PaLM 2 might be used to facilitate rich, informative discussions, answer complex medical questions, and find insights in complicated and unstructured medical texts. They might also explore its utility to help draft short- and long-form responses and summarize documentation and insights from internal data sets and bodies of scientific knowledge.

Innovating responsibly with AI

Since last year, we’ve been researching and evaluating Med-PaLM and Med-PaLM 2, assessing it against multiple criteria — including scientific consensus, medical reasoning, knowledge recall, bias, and likelihood of possible harm — which were evaluated by clinicians and non-clinicians from a range of backgrounds and countries.

Med-PaLM 2's impressive performance on medical exam-style questions is a promising development, but we need to learn how this can be harnessed to benefit healthcare workers, researchers, administrators, and patients. In building Med-PaLM 2, we’ve been focused on safety, equity, and evaluations of unfair bias. Our limited access for select Google Cloud customers will be an important step in furthering these efforts, bringing in additional expertise across the healthcare and life sciences ecosystem.

What’s more, when Google Cloud brings new AI advances to our products, our commitment is two-fold: to not only deliver transformative capabilities, but also ensure our technologies include proper protections for our organizations, their users, and society. To this end, our AI Principles, established in 2017, form a living constitution that guides our approach to building advanced technologies, conducting research, and drafting our product development policies.

From AI to generative AI

Google's deep history in AI informs our work in generative AI technologies, which can find complex relationships in large sets of training data, then generalize from what they learn to create new data. Breakthroughs such as the Transformer have enabled LLMs and other large models to scale to billions of parameters, letting generative AI move beyond the limited pattern-spotting of earlier AIs and into the creation of novel expressions of content, from speech to scientific modeling.

Google Cloud is committed to bringing to market products that are informed by our research efforts across Alphabet. In 2022, we introduced a deep integration between Google Cloud and Alphabet's AI research organizations, which allows Vertex AI to run DeepMind's groundbreaking protein structure prediction system, AlphaFold.

Much more is on the way. In one sense, generative AI is revolutionary. In another, it's the familiar technology story of more and better computing creating new industries, from desktop publishing to the internet, social networks, mobile apps, and now, generative AI.

Building on AI leadership

Additionally, today we’re announcing a new AI-enabled Claims Acceleration Suite, designed to streamline processes for health insurance prior authorization and claims processing. The Claims Acceleration Suite helps both providers of insurance plans and healthcare to create operational efficiencies and reduce administrative burdens and costs by converting unstructured data into structured data that help experts make faster decisions and improve access to timely patient care.

On the clinical side, last year we announced Medical Imaging Suite, an AI-assisted diagnosis technology being used by Hologic to improve cervical cancer diagnoses and Hackensack Meridian Health to predict metastasis in patients with prostate cancer. Elsewhere, Mayo Clinic and Google have collaborated on an AI algorithm to improve the care of head and neck cancers, and Google Health recently partnered with iCAD to improve breast cancer screening with AI.

From these examples and more, it's clear that the healthcare industry has moved from testing AI to deploying it to improve workflows, solve business problems, and speed healing. With this in mind, we expect rapid interest in and uptake of generative AI technologies. Healthcare organizations are eager to learn about generative AI and how they can use it to make a real difference.

Looking ahead

The power of AI has reinforced Google Cloud's commitment to privacy, security, and transparency. Our platforms are designed to be flexible, including data and model lineage capabilities, integrated security and identity management services, support for third-party models, choice and transparency on models and costs, integrated billing and entitlement support, and support across many languages.

While we’ll have some innovations like Med-PaLM 2 that are tuned for healthcare, we also have products that are relevant across industries. Last month, we announced several generative AI capabilities coming to Google Cloud, including Generative AI support in Vertex AI and Generative AI App Builder, which are already being tested by a number of customers. Developers and businesses already use Vertex AI to build and deploy machine learning models and AI applications at scale, and we recently added Generative AI support in Vertex AI. This gives customers foundation models they can fine-tune with their own data, and the ability to deploy applications with this powerful new technology. We also launched Generative AI App Builder to help organizations build their own AI-powered chat interfaces and digital assistants in minutes or hours by connecting conversational AI flows with out-of-the-box search experiences and foundation models.

As AI proves its value, it's likely there will be increased focus on high-quality data collection and curation in healthcare and life sciences. Improving the flow and unification of data across health care systems, referred to as data interoperability, is one of the most important building blocks to leveraging AI, and it helps organizations run more effectively, improve patient care, and helps people live healthier lives. We expect to continue our investments in technology, infrastructure, and data governance.

We're committed to realizing the potential of this technology in healthcare. By working with a handful of trusted healthcare organizations early on, we’ll learn more about what can be achieved, and how this technology can safely advance. For all of us, the prospects are inspiring, humbling, and exciting.

If you’re interested in exploring generative AI on Cloud, you can sign-up for our Trusted Tester program or reach out to your Google Cloud sales representative.

bnew · May 21, 2023

https://archive.is/T6cSx

bnew · May 21, 2023

bnew · May 21, 2023

Majestyx · May 21, 2023

bnew said:
my goodneess! it opened it's mouth.

edit:

whoa

https://s3.amazonaws.com/moonup/production/uploads/60a551a34ecc5d054c8ad93e/asSF_QiOtZ-Iqv2fV2-QE.mp4

project page:

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Synthesizing visual content that meets users' needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or...

vcai.mpi-inf.mpg.de

me and some brehs that work in animation were talking about this, brehs is looking for new jobs :francis:

bnew · May 21, 2023

GitHub - OpenGVLab/Instruct2Act: Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model - GitHub - OpenGVLab/Instruct2Act: Instruct2Act: Mapping Multi-modality Instructions to Robotic Action...

github.com

About

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to generate Python programs that constitute a comprehensive perception, planning, and action loop for robotic tasks. In the perception section, pre-defined APIs are used to access multiple foundation models where the Segment Anything Model (SAM) accurately locates candidate objects, and CLIP classifies them. In this way, the framework leverages the expertise of foundation models and robotic abilities to convert complex high-level instructions into precise policy codes. Our approach is adjustable and flexible in accommodating various instruction modalities and input types and catering to specific task demands. We validated the practicality and efficiency of our approach by assessing it on robotic tasks in different scenarios within tabletop manipulation domains. Furthermore, our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.

https://arxiv.org/pdf/2305.11176.pdf

google bard summary:

Researchers created a new way to make robots follow instructions.
The new way uses a large language model to translate instructions into code that the robot can understand.
The new way is more flexible and efficient than previous methods.
The new way was tested on a variety of tasks and it worked well.

Here's a more detailed explanation of how the new method works:

The new method uses a large language model (LLM) to translate instructions into code. The LLM is a computer program that has been trained on a massive dataset of text and code. This training allows the LLM to understand the meaning of instructions and to generate code that can be executed by a robot.

The instructions that the LLM can translate are called multi-modal instructions. Multi-modal instructions are instructions that include information from multiple sources. For example, a multi-modal instruction might include a picture of an object, the name of the object, and a description of what to do with the object.

The LLM uses the information from the multi-modal instruction to generate code that tells the robot what to do. The code that the LLM generates is specific to the task that the robot is supposed to perform. For example, if the robot is supposed to pick up a cup, the LLM will generate code that tells the robot how to move its arm and its hand to pick up the cup.

The new method is more flexible and efficient than previous methods for making robots follow instructions. Previous methods were limited to translating simple instructions. The new method can translate complex instructions, including instructions that include information from multiple sources. The new method is also more efficient than previous methods. It can translate instructions much faster than previous methods.

The new method was tested on a variety of tasks and it worked well. The method was able to successfully translate instructions and make robots follow those instructions. The method was also able to generalize to new tasks. This means that the method can be used to make robots follow instructions for tasks that it has never seen before.

The new method is a significant improvement over previous methods for making robots follow instructions. It is more flexible, efficient, and generalizable than previous methods. The new method has the potential to make robots more useful and more accessible.

bnew · May 21, 2023

https://archive.is/jIwn9

Pretty sure bard is worse that Vicuna-13B. Vicuna-13B is almost as good as 3.5-turbo for most tasks except coding (verified by Microsoft as well https://github.com/microsoft/guidance/blob/8677f3aa269e05ecbb942585560a44db51d507ca/notebooks/chatgpt_vs_open_source_on_harder_tasks.ipynb…) Not sure what's plan at Goog HQ. GPT-4 is leaps and bounds better than every model rn.

Mother of all LLM benchmarks! - Use GPT-4 if you need best quality - Use claude-instant-v1 for everything else - Google PaLM2 is nowhere near OpenAI/Anthropic - OpenAI models are painfully slow compared to competition - Open-source models next source: https://github.com/kagisearch/pyllms

https://archive.is/3s842

guidance/notebooks/chatgpt_vs_open_source_on_harder_tasks.ipynb at 8677f3aa269e05ecbb942585560a44db51d507ca · guidance-ai/guidance

A guidance language for controlling large language models. - guidance-ai/guidance

github.com

bnew · May 21, 2023

GitHub - ray-project/llm-numbers: Numbers every LLM developer should know

Numbers every LLM developer should know. Contribute to ray-project/llm-numbers development by creating an account on GitHub.

github.com

Numbers every LLM Developer should know

At Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know. It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage.

Notes on the Github version

Last updates: 2023-05-17

If you feel there's an issue with the accuracy of the numbers, please file an issue. Think there are more numbers that should be in this doc? Let us know or file a PR.

We are thinking the next thing we should add here is some stats on tokens per second of different models.

Prompts

40-90%1: Amount saved by appending “Be Concise” to your prompt

It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money.

1.3:1 -- Average tokens per word

LLMs operate on tokens. Tokens are words or sub-parts of words, so “eating” might be broken into two tokens “eat” and “ing”. A 750 word document in English will be about 1000 tokens. For languages other than English, the tokens per word increases depending on their commonality in the LLM's embedding corpus.

Knowing this ratio is important because most billing is done in tokens, and the LLM’s context window size is also defined in tokens.

Prices2

Prices are of course subject to change, but given how expensive LLMs are to operate, the numbers in this section are critical. We use OpenAI for the numbers here, but prices from other providers you should check out (Anthropic, Cohere) are in the same ballpark.

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

What this means is that for many practical applications, it’s much better to use GPT-4 for things like generation and then use that data to fine tune a smaller model. It is roughly 50 times cheaper to use GPT-3.5-Turbo than GPT-4 (the “roughly” is because GPT-4 charges differently for the prompt and the generated output) – so you really need to check on how far you can get with GPT-3.5-Turbo. GPT-3.5-Turbo is more than enough for tasks like summarization for example.

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding

This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding

Note: this number is sensitive to load and embedding batch size, so please consider this approximate.

In our blog post, we noted that using a g4dn.4xlarge (on-demand price: $1.20/hr) we were able to embed at about 9000 tokens per second using Hugging Face’s SentenceTransformers (which are pretty much as good as OpenAI’s embeddings). Doing some basic math of that rate and that node type indicates it is considerably cheaper (factor of 10 cheaper) to self-host embeddings (and that is before you start to think about things like ingress and egress fees).

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries

It costs you 6 times as much to serve a fine tuned model as it does the base model on OpenAI. This is pretty exorbitant, but might make sense because of the possible multi-tenancy of base models. It also means it is far more cost effective to tweak the prompt for a base model than to fine tune a customized model.

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries

If you’re self hosting a model, then it more or less costs the same amount to serve a fine tuned model as it does to serve a base one: the models have the same number of parameters.

Training and Fine Tuning

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs. We considered training our own model on the Red Pajama training set, then we ran the numbers. The above is assuming everything goes right, nothing crashes, and the calculation succeeds on the first time, etc. Plus it involves the coordination of 2048 GPUs. That’s not something most companies can do (shameless plug time: of course, we at Anyscale can – that’s our bread and butter! Contact us if you’d like to learn more). The point is that training your own LLM is possible, but it’s not cheap. And it will literally take days to complete each run. Much cheaper to use a pre-trained model.

< 0.001: Cost ratio of fine tuning vs training from scratch

This is a bit of a generalization, but the cost of fine tuning is negligible. We showed for example that you can fine tune a 6B parameter model for about $7. Even at OpenAI’s rate for its most expensive fine-tunable model, Davinci, it is 3c per 1000 tokens. That means to fine tune on the entire works of Shakespeare (about 1 million words), you’re looking at $405. However, fine tuning is one thing and training from scratch is another …

GPU Memory

If you’re self-hosting a model, it’s really important to understand GPU memory because LLMs push your GPU’s memory to the limit. The following statistics are specifically about inference. You need considerably more memory for training or fine tuning.

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities

It may seem strange, but it’s important to know the amount of memory different types of GPUs have. This will cap the number of parameters your LLM can have. Generally, we like to use A10Gs because they cost $1.50 to $2 per hour each at AWS on-demand prices and have 24G of GPU memory, vs the A100s which will run you about $5 each at AWS on-demand prices.

2x number of parameters: Typical GPU memory requirements of an LLM for serving

For example, if you have a 7 billion parameter model, it takes about 14GB of GPU space. This is because most of the time, one 16-bit float (or 2 bytes) is required per parameter. There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy you start to lose resolution (though that may be acceptable in some cases). Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.

~1GB: Typical GPU memory requirements of an embedding model

Whenever you are doing sentence embedding (a very typical thing you do for clustering, semantic search and classification tasks), you need an embedding model like sentence transformers. OpenAI also has its own embeddings that they provide commercially.

You typically don’t have to worry about how much memory embeddings take on the GPU, they’re fairly small. We’ve even had the embedding and the LLM on the same GPU.

>10x: Throughput improvement from batching LLM requests

Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second. The funny thing is, though, if you run two tasks, it might only take 5.2 seconds. This means that if you can bundle 25 queries together, it would take about 10 seconds, and our throughput has improved to 2.5 queries per second. However, see the next point.

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model

The amount of memory you need is directly proportional to the maximum number of tokens you want to generate. So for example, if you want to generate outputs of up to 512 tokens (about 380 words), you need 512MB. No big deal you might say – I have 24GB to spare, what’s 512MB? Well, if you want to run bigger batches it starts to add up. So if you want to do batches of 16, you need 8GB of space. There are some techniques being developed that overcome this, but it’s still a real issue.

Cheatsheet

Orbital-Fetus · May 21, 2023

The A.I Megathread (LLM , GPT , Development)

Veteran

Chinese Nobel laureate Mo Yan shocks audience after revealing he used ChatGPT to write speech​

​

‘I was supposed to write a commendation for [Yu Hua] as per tradition, but I struggled for several days,’ novelist said​

Veteran

About​

MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing​

Updates​

Introduction​

Main Features​

1 Consistent Image Synthesis and Editing​

2 Integration to Controllable Diffusion Models​

3 Generalization to Other Models: Anything-V4​

Veteran

Implementation of MasaCtrl in webui​

Veteran

New: Code for Null-Text Inversion is now provided here​

Project Page Paper​

Moderator

Veteran

New Features/Updates​

Introduction​

Download Models​

Veteran

A responsible path to generative AI in healthcare​

Aashima Gupta​

Amy Waldron​

Innovating responsibly with AI​

From AI to generative AI​

Building on AI leadership​

Looking ahead​

Veteran

Veteran

Veteran

Duck Season

Veteran

About​

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model​

Veteran

Veteran

Numbers every LLM Developer should know​

Notes on the Github version​

Prompts​

40-90%1: Amount saved by appending “Be Concise” to your prompt​

1.3:1 -- Average tokens per word​

Prices2​

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3​

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding​

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding​

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries​

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries​

Training and Fine Tuning​

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens​

< 0.001: Cost ratio of fine tuning vs training from scratch​

GPU Memory​

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities​

2x number of parameters: Typical GPU memory requirements of an LLM for serving​

~1GB: Typical GPU memory requirements of an embedding model​

>10x: Throughput improvement from batching LLM requests​

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model​

Cheatsheet​

cross that bridge

Similar threads

Chinese Nobel laureate Mo Yan shocks audience after revealing he used ChatGPT to write speech

‘I was supposed to write a commendation for [Yu Hua] as per tradition, but I struggled for several days,’ novelist said

About

MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Updates

Introduction

Main Features

1 Consistent Image Synthesis and Editing

2 Integration to Controllable Diffusion Models

3 Generalization to Other Models: Anything-V4

Implementation of MasaCtrl in webui

New: Code for Null-Text Inversion is now provided here

Project Page Paper

New Features/Updates

Introduction

Download Models

A responsible path to generative AI in healthcare

Aashima Gupta

Amy Waldron

Innovating responsibly with AI

From AI to generative AI

Building on AI leadership

Looking ahead

About

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Numbers every LLM Developer should know

Notes on the Github version

Prompts

40-90%1: Amount saved by appending “Be Concise” to your prompt

1.3:1 -- Average tokens per word

Prices2

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries

Training and Fine Tuning

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

< 0.001: Cost ratio of fine tuning vs training from scratch

GPU Memory

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities

2x number of parameters: Typical GPU memory requirements of an LLM for serving

~1GB: Typical GPU memory requirements of an embedding model

>10x: Throughput improvement from batching LLM requests

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model

Cheatsheet