bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832


GALPACA 30B (large)​

GALACTICA 30B fine-tuned on the Alpaca dataset.

The model card from the original Galactica repo can be found here, and the original paper here.

The dataset card for Alpaca can be found here, and the project homepage here. The Alpaca dataset was collected with a modified version of the Self-Instruct Framework, and was built using OpenAI's text-davinci-003 model. As such it is subject to OpenAI's terms of service.

Model Details​

The GALACTICA models are trained on a large-scale scientific corpus and are designed to perform scientific tasks. The Alpaca dataset is a set of 52k instruct-response pairs designed to enhace the instruction following capabilites of pre-trained language models.

Model Use​

The GALACTICA model card specifies that the primary indended users of the GALACTICA models are researchers studying language models applied to the scientific domain, and it cautions against production use of GALACTICA without safeguards due to the potential for the model to produce inaccurate information. The original GALACTICA models are available under a non-commercial CC BY-NC 4.0 license, and the GALPACA model is additionally subject to the OpenAI Terms of Service.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832

Fs4Ljx2WcAA3pcL
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832


Natural Language → SQL​

🌉 Demo on San Francisco City Data: SanFranciscoGPT.com

🇺🇸 Demo on US Census Data: CensusGPT.com

SanFranciscoGPT CensusGPT Join the Discord Server


Welcome to textSQL, a project which uses LLMs to democratize access to data analysis. Example use cases of textSQL are San Francisco GPT and CensusGPT — natural language interfaces to public data (SF city data and US census data), enabling anyone to analyze and gain insights from the data.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832



dolly-v2-2-8b Model Card​

Summary​

Databricks’ dolly-v2-2-8b, an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. Based on pythia-2.8b, Dolly is trained on ~15k instruction/response fine tuning records databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization. dolly-v2-2-8b is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

Owner: Databricks, Inc.

Model Overview​

dolly-v2-2-8b is a 2.8 billion parameter causal language model created by Databricks that is derived from EleutherAI’s Pythia-2.8b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees and released under a permissive license (CC-BY-SA)


Databricks’ dolly-v2-6-9b, an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. Based on pythia-6.9b, Dolly is trained on ~15k instruction/response fine tuning records databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization. dolly-v2-6-9b is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

Owner: Databricks, Inc.

Model Overview​

dolly-v2-6-9b is a 6.9 billion parameter causal language model created by Databricks that is derived from EleutherAI’s Pythia-6.9b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees and released under a permissive license (CC-BY-SA)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832

How to fine tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + HuggingFace​


This is part 4 of our blog series on Generative AI. In the previous blog posts we explained why Ray is a sound platform for Generative AI, we showed how it can push the performance limits, and how you can use Ray for stable diffusion.

In this blog, we share a practical approach on how you can use the combination of HuggingFace, DeepSpeed, and Ray to build a system for fine-tuning and serving LLMs, in 40 minutes for less than $7 for a 6 billion parameter model. In particular, we illustrate the following:
  • Using these three components, you can simply and quickly put together an open-source LLM fine-tuning and serving system.
  • By taking advantage of Ray’s distributed capabilities, we show how this can be both more cost-effective and faster than using a single large (and often unobtainable) machine.

Here’s what we’ll be doing:​

  • Discussing why you might want to run your own LLM instead of using one of the new API providers.
  • Showing you the evolving tech stack we are seeing for cost-effective LLM fine-tuning and serving, combining HuggingFace, DeepSpeed, Pytorch, and Ray.
  • Showing you 40 lines of Python code that can enable you to serve a 6 billion parameter GPT-J model.
  • Showing you, for less than $7, how you can fine-tune the model to sound more medieval using the works of Shakespeare by doing it in a distributed fashion on low-cost machines, which is considerably more cost-effective than using a single large powerful machine.
  • Showing how you can serve the fine-tuned 6B LLM compiled model binary.
  • Showing how the fine-tuned model compares to a prompt engineering approach with large systems.

Why would I want to run my own LLM?​

There are many, many providers of LLM APIs online. Why would you want to run your own? There are a few reasons:
  • Cost, especially for fine-tuned inference: For example, OpenAI charges 12c per 1000 tokens (about 700 words) for a fine-tuned model on Davinci. It’s important to remember that many user interactions require multiple backend calls (e.g. one to help with the prompt generation, post-generation moderation, etc), so it’s very possible that a single interaction with an end user could cost a few dollars. For many applications, this is cost prohibitive.
  • Latency: using these LLMs is especially slow. A GPT-3.5 query for example can take up to 30 seconds. Combine a few round trips from your data center to theirs and it is possible for a query to take minutes. Again, this makes many applications impossible. Bringing the processing in-house allows you to optimize the stack for your application, e.g. by using low-resolution models, tightly packing queries to GPUs, and so on. We have heard from users that optimizing their workflow has often resulted in a 5x or more latency improvement.
  • Data Security & Privacy: In order to get the response from these APIs, you have to send them a lot of data for many applications (e.g. send a few snippets of internal documents and ask the system to summarize them). Many of the API providers reserve the right to use those instances for retraining. Given the sensitivity of organizational data and also frequent legal constraints like data residency, this is especially limiting. One, particularly concerning recent development, is the ability to regenerate training data from learned models, and people unintentionally disclosing secret information.

OK, so how do I run my own?​

The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. What we are seeing is a particular technology stack that combines multiple technologies:
llm-stack

What we’ve also seen is a reluctance to go beyond a single machine for training. In part, because there is a perception that moving to multiple machines is seen as complicated. The good news is this is where Ray.io shines (ba-dum-tish). It simplifies cross-machine coordination and orchestration aspects using not much more than Python and Ray decorators, but also is a great framework for composing this entire stack together.

Recent results on Dolly and Vicuna (both trained on Ray or trained on models built with Ray like GPT-J) are small LLMs (relatively speaking – say the open source model GPT-J-6B with 6 billion parameters) that can be incredibly powerful when fine-tuned on the right data. The key is fine-tuning and the right data parts. So you do not always need to use the latest and greatest model with 150 billion-plus parameters to get useful results. Let’s get started!

Serving a pre-existing model with Ray for text generation​

The detailed steps on how to serve the GPT-J model with Ray can be found here, so let’s highlight some of the aspects of how we do that.

Code:
@serve.deployment(ray_actor_options={"num_gpus":1})
   classPredictDeployment:
     def__init__(self, model_id:str, revision:str=None):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        self.model = AutoModelForCausalLM.from_pretrained(
            "EleutherAI/gpt-j-6B",
            revision=revision,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto",  # automatically makes use of all GPUs available to the Actor
        )


Serving in Ray happens in actors, in this case, one called PredictDeployment. This code shows the __init__ method of the action that downloads the model from Hugging Face. To launch the model on the current node, we simply do:

Code:
deployment = PredictDeployment.bind(model_id=model_id, revision=revision)
serve.run(deployment)


That starts a service on port 8000 of the local machine.
We can now query that service using a few lines of Python

Code:
import requests
prompt = (
    “Once upon a time, there was a horse. “
)
sample_input = {"text": prompt}
output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)
And sure enough, this prints out a continuation of the above opening. Each time it runs, there is something slightly different.

Once upon a time, there was a horse.
But this particular horse was too big to be put into a normal stall. Instead, the animal was moved into an indoor pasture, where it could take a few hours at a time out of the stall. The problem was that this pasture was so roomy that the horse would often get a little bored being stuck inside. The pasture also didn’t have a roof, and so it was a good place for snow to accumulate.

This is certainly an interesting direction and story … but now we want to set it in the medieval era. What can we do?

Fine Tuning​

Now that we’ve shown how to serve a model, how do we fine-tune it to be more medieval? What about if we train it on 2500 lines from Shakespeare?
This is where DeepSpeed comes in. DeepSpeed is a set of optimized algorithms for training and fine-tuning networks. The problem is that DeepSpeed doesn’t have an orchestration layer. This is not so much of a problem on a single machine, but if you want to use multiple machines, this typically involves a bunch of bespoke ssh commands, complex managed keys, and so on.

This is where Ray can help.

This page in the Ray documentation discusses how to fine-tune it to sound more like something from the 15th century with a bit of flair.
Let’s go through the key parts. First, we load the data from hugging face

Code:
from datasets import load_dataset
print("Loading tiny_shakespeare dataset")
current_dataset = load_dataset("tiny_shakespeare")


Skipping the tokenization code, here’s the heart of the code that we will run for each worker.

Code:
def trainer_init_per_worker(train_dataset, eval_dataset=None,**config):
    # Use the actual number of CPUs assigned by Ray
    model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
    model.resize_token_embeddings(len(tokenizer))
    enable_progress_bar()
    metric = evaluate.load("accuracy")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )
    return trainer

And now we create a Ray AIR HuggingFaceTrainer that orchestrates the distributed run and wraps around multiple copies of the training loop above:

Code:
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    trainer_init_config={
        "batch_size":16,  # per device
        "epochs":1,
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU":1,"CPU": cpus_per_worker},
    ),
    datasets={"train": ray_datasets["train"],"evaluation": ray_datasets["validation"]},
    preprocessor=Chain(splitter, tokenizer),
)
results = trainer.fit()
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832
While there is some complexity here, it is not much more complex than the code to get it to run on a simple machine.

Fine-tuning and Performance​

One of the most important topics related to LLMs is the question of cost. In this particular case, the costs are small (in part because we ran only one epoch of fine-tuning, depending on the problem 1-10 epochs of fine-tuning are used, and also in part because this dataset is not so large). But running the tests on different configurations shows us that understanding performance is not always easy. The below shows some benchmarking results with different configurations of machines on AWS.
Configuration#instancesTime (mins)Total Cost (on-demand)Total Cost (spot)Cost Ratio
16 x g4dn.4xlarge (1 x T4 16GB GPU)1648$15.41$6.17100%
32 x g4dn.4xlarge (1 x T4 16GB GPU)3230$19.26$7.71125%
1 x p3.16xlarge (8 x V100 16GB GPU)144$17.95$9.27150%
1 x g5.48xlarge (8 x A10G 24GB GPU)184$22.81$10.98178%
llm-graph

Note: we tried to run the same test with A100s, but we were unable to obtain the p4d machines to do so.
Looking at these numbers, we see some surprises:
  • Perhaps the most obvious machine to use – the g5.48xlarge – the machine with the highest on-paper performance – is both the most expensive and the slowest at almost twice the price when using spot instances.
  • The p3.16xlarge with its use of NVLink between the GPUs is a considerably better option.
  • Most surprising of all, using multiple machines is both the cheapest and the fastest option.
The exact same code is running on all the machines, and aside from tweaking the number of GPU workers, nothing else was changed. Using multiple machines gave us the cheapest (16 machines) and the fastest (32 machines) option of the ones we benchmarked.

This is the beauty and power of Ray. The code itself was simple enough, and in fact, was able to use a standard library – DeepSpeed – with no modifications. So it was no more complex in this case than a single machine. Simultaneously, it gave more options and flexibility to optimize to be both cheaper and faster than a single machine.

Closing the loop: Serving the fine-tuned model​

Now that we have a fine-tuned model, let’s try to serve it. The only change we need to make is to (a) copy the model to s3 from the fine-tuning process and (b) load it from there. In other words, the only change from the previous code we started with originally is:

Code:
        checkpoint = Checkpoint.from_uri(
             "s3://demo-pretrained-model/gpt-j-6b-shakespeare"
        )
        with checkpoint.as_directory() as dir:
            self.model = AutoModelForCausalLM.from_pretrained(
                dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                device_map="auto")
            self.tokenizer = AutoTokenizer.from_pretrained(dir)


And now let’s try querying it again:
Once upon a time there was a horse. This horse was in my youth, a little unruly, but yet the best of all. I have, sir; I know every horse in the field, and the best that I have known is the dead. And now I thank the gods, and take my leave.
As you can see, it definitely has more of a Shakespearean flavor.

Conclusion​

We have shown a new tech stack that combines Ray, HuggingFace, DeepSpeed, and PyTorch to make a system that:
  • Makes it simple and quick to deploy as a service.
  • Can be used to cost-effectively fine-tune and is actually most cost-effective when using multiple machines without the complexity.
  • How fine-tuning – even a single epoch – can change the output of a trained model.
  • Deploying a fine-tuned model is only marginally harder than deploying a standard one.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832
0IkkgqF.png



About​

Whisper & GPT-based app for passing remote SWE interviews

Cheetah is an AI-powered macOS app designed to assist users during remote software engineering interviews by providing real-time, discreet coaching and live coding platform integration.

bXZVDSZ.png


With Cheetah, you can improve your interview performance and increase your chances of landing that $300k SWE job, without spending your weekends cramming leetcode challenges and memorizing algorithms you'll never use.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,832

vicuna-13b-free

Vicuna 1.0 13B trained on the unfiltered dataset V3 "no imsorry" (sha256 014bcc3352fd62df5bbb7fb8af9b4fd12f87bb8a2b48a147789f245176ac8e4f)

Note. Unfiltered Vicuna is work in progress. Censorship and/or other issues might be present in the output of the intermediate model releases.



Vicuna 7B without "ethics" filtering

This repository contains an alternative version of the Vicuna 7B model.

This model was natively fine-tuned using ShareGPT data, but without the "ethics" filtering used for the original Vicuna.

A GPTQ quantised 4-bit version is available here.

Original Vicuna Model Card​

Model details​

Model type: Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture.

Model date: Vicuna was trained between March 2023 and April 2023.

Organizations developing the model: The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.
 
Last edited:
Top