bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

New Open-Source ‘Falcon’ AI Language Model Overtakes Meta and Google​

Meta's massive, 70-billion parameter LLaMA 2 has been outflanked by the newly released 180-billion parameter Falcon Large Language Model.​

Jose Antonio Lanz - Decrypt

By Jose Antonio Lanz
Sep 6, 2023
3 min read

jaldps_a_powerful_robotic_falcon_flying_towards_the_camera_RAW__91f9db25-cf6b-4c07-99d8-d9c1d6114cab-gID_7.png@webp

The artificial intelligence community has a new feather in its cap with the release of Falcon 180B, an open-source large language model (LLM) boasting 180 billion parameters trained on a mountain of data. This powerful newcomer has surpassed prior open-source LLMs on several fronts.

Announced in a blog post by the Hugging Face AI community, Falcon 180B has been released on Hugging Face Hub. The latest-model architecture builds on the previous Falcon series of open source LLMs, leveraging innovations like multiquery attention to scale up to 180 billion parameters trained on 3.5 trillion tokens.


This represents the longest single-epoch pretraining for an open source model to date. To achieve such marks, 4,096 GPUs were used simultaneously for around 7 million GPU hours, using Amazon SageMaker for training and refining.

To put the size of Falcon 180B into perspective, its parameters measure 2.5 times larger than Meta's LLaMA 2 model. LLaMA 2 was previously considered the most capable open-source LLM after its launch earlier this year, boasting 70 billion parameters trained on 2 trillion tokens.

Falcon 180B surpasses LLaMA 2 and other models in both scale and benchmark performance across a range of natural language processing (NLP) tasks. It ranks on the leaderboard for open access models at 68.74 points and reaches near parity with commercial models like Google's PaLM-2 on evaluations like the HellaSwag benchmark.

palm2_480.jpg@webp
Image: Hugging Face

Specifically, Falcon 180B matches or exceeds PaLM-2 Medium on commonly used benchmarks, including HellaSwag, LAMBADA, WebQuestions, Winogrande, and more. It is basically on par with Google’s PaLM-2 Large. This represents extremely strong performance for an open-source model, even when compared against solutions developed by giants in the industry.

When compared against ChatGPT, the model is more powerful than the free version but a little less capable than the paid “plus” service.
“Falcon 180B typically sits somewhere between GPT 3.5 and GPT4 depending on the evaluation benchmark, and further finetuning from the community will be very interesting to follow now that it's openly released.” the blog says.


The release of Falcon 180B represents the latest leap forward in the rapid progress that has recently been made with LLMs. Beyond just scaling up parameters, techniques like LoRAs, weight randomization and Nvidia’s Perfusion have enabled dramatically more efficient training of large AI models.

With Falcon 180B now freely available on Hugging Face, researchers anticipate the model will see additional gains with further enhancements developed by the community. However, its demonstration of advanced natural language capabilities right out of the gate marks an exciting development for open-source AI.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827


Falcon 180B[New OSS king] vs GPT4[The Legend].​

https://medium.com/@prakharsaxena11...-b9b4044e4bf1--------------------------------
Prakhar Saxena
Follow
5 min read
1 day ago


1*Vd9HXSOVF-AG_yd_erSRUA.png

Notification by TII for Falcon 180B.
Falcon 180B was released by TII (research facility based in UAE) yesterday as open source king. With 180 billion parameters and training of 3.5 trillion tokens. It is undoubtedly the largest open source model available. Hosting Falcon 180B requires more than 400GB of memory, but fortunately, Huggingface is hosting it for us to use . TII’s report states that Falcon180B performs at par with the Google’s PaLM2 and sits right behind GPT4. So, I decided to test it out.

We will quickly check compare: Coding ability, reasoning, literature, knowledge and multilingual abilities.

Each section will have a score out of 10, and the score will be added for each section.

Coding ability​

Prompt: Write a function that takes a list of integers as an input and returns the sum of all even numbers in the list. This function should be implemented in Python.

Falcon 180B-

1*CsXtFq6BtpVv6WZBa2tsqA.png

GPT4-

1*V1KNU5Etre0RmcuXqawEEg.png

Comments:

  • The task is very simple.
  • Both do a good job.
  • GPT4 provides more solutions and explains them.
  • Falcon — 8/10
  • GPT4–10/10

Reasoning​

Prompt: Blueberries cost more than strawberries.
Blueberries cost less than raspberries.
Raspberries cost more than strawberries and blueberries.
If the first two statements are true, the third statement is?


Falcon 180B-

1*_d6KVT6ZN4j9DkLDcHxalg.png

GPT4-

1*nCF5S0zkIq-k96SqnJTXrA.png

Comments:

  • Both give correct answers.
  • Falcon — 18/20
  • GPT4 — 20/20

Literature​

Prompt: Write a poem about AI in 32 words.

Falcon 180B-

1*BMhWuu8QrD-QEzWU-YSt2w.png

GPT4-

1*zPwr0FEYpE_idhJ0k_zO6Q.png

Comments:

  • Both of them don’t have exactly 32 words. ( More words by Falcon and less by GPT4).
  • Falcon’s poem sounds scary (It is plotting against us).
  • Falcon — 26/30
  • GPT4–28/30

Knowledge​

Prompt: When did the world war 2 end?

Falcon 180B-

1*a7PFbmws39jXJq2P9pA3UA.png

GPT4-

1*NYISx6DyyOnngbLpk_AN_w.png

Comments:

  • Both of them answer correctly.
  • GPT4 provides more details, however I think that Falcon can do the same if the generation parameters are tweaked.
  • Falcon — 36/40
  • GPT4–38/40

Multilingual ability​

We will test the model for its response across three different languages, with different families. We will test for English, Hindi and Japanese, since these are the languages I understand reasonably well.(Flex)

Prompt: I am about to give you some instructions in Hindi, response to that in Japanese- दुनिया में कितने देश हैं?

Falcon 180B-

1*HLsGC42yHZPuKR7aWuAGNw.png

GPT4-

1*Dg-qcId0f-AEZj_OpDqQUg.png

Comments:

  • The question I asked was “How many countries are there in the world”.
  • Falcon does not provides any answer, just translates the question to Japanese. Then it outputs the Hindi and Japanese as Roman characters.
  • GPT4 correctly answers the question, showing that it understands all three languages.
  • Falcon — 42/50
  • GPT4–48/50

Conclusion​

Open source models have developed very rapidly. Falcon180B is very capable in every aspect apart from Multilingual ability. It will be fun to watch what the open source community does with this model. The future looks fun.
P.S. I fed the whole article into Falcon 180b and GPT4 and asked them to generate conclusion. Here are the results:

Falcon 180B-

Conclusion: Based on the tests conducted above, it’s evident that TII’s Falcon 180B is highly capable of performing at par with GPT 4 with a little room for improvement in coding tasks and providing better answers to the questions. Considering that this model is an open-source alternative for large language models, the results of this comparison are remarkable. The “open-source king” definitely has what it takes to compete with the best. With TII continuing to develop such models, we can expect even better performance from the open-source community in the future.

GPT4-

Conclusion

In a face-off between Falcon 180B, the largest open-source model recently released by TII, and GPT4, the results show that size isn’t the only determinant of capability. Across the five domains of Coding ability, Reasoning, Literature, Knowledge, and Multilingual abilities, GPT4 consistently scored higher.

While Falcon 180B boasts a massive parameter count and has impressive capacities, it doesn’t quite match GPT4 in terms of versatility, in-depth understanding, and multilingual comprehension. Especially notable was the multilingual test, where Falcon showed gaps in multi-language translation and understanding.

Both models exhibit proficiency in coding, showcasing their ability to generate Python code. The reasoning task revealed that both models can understand logical statements, and their knowledge of historical events like the end of World War 2 is commendable.

However, when it comes to creative tasks such as writing poetry, GPT4 seems to have a slight edge. While both didn’t stick to the word limit, GPT4’s rendition was found more appealing.

Overall, while Falcon 180B is a monumental achievement as an open-source model and certainly has its merits, GPT4 remains more versatile across various tasks. It’ll be exciting to see how these models evolve and what further innovations the AI community will bring in the future.


Haha, seems like they are dissing each other.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

August 31, 2023

Comparing Code Llama Models Locally​


Srini Kadamati
XetHub Blog | Comparing Code Llama Models Locally


64f0a1e721a342acc1e561a2_thumbnail.jpg

Trying out new LLM’s can be cumbersome. Two of the biggest challenges are:
  • Disk space: there are many different variants of each LLM and downloading all of them to your laptop or desktop can use up 500-1000 GB of disk space easily.
  • No access to an NVIDIA GPU: most people don’t have an NVIDIA GPU lying around, but modern laptops (like the M1 and M2 MacBooks) have surprisingly good graphics capabilities.

In this post, we’ll showcase how you can stream individual model files on-demand (which helps reduce the burden on your disk space) and how you can use quantized models to run on your local machine’s graphics hardware (which helps with the 2nd challenge).

We wrote this post with owners of Apple Silicon pro computers in mind (e.g. M1 / M2 MacBook Pro or Mac Studio) but you can modify a single instruction (the llama.cpp compilation instruction) to try on other platforms.

Before we dive in, we’re thankful for the work of TheBloke (Tom Jobbins) for quantizing the models themselves, the Llama.cpp community, and Meta for making it possible to even try these models locally with just a few commands.

Llama 2 vs Code Llama

As a follow up to Llama 2, Meta recently released a specialized set of models named Code Llama. These models have been trained on code specific datasets for better performance on coding assistance tasks. According to a slew of benchmark measures, the Code Llama models perform better than just regular Llama 2:

64f09c53a9541f46fd128422_meta.png

Code Llama also was trained to provide stable generation with up to 100,000 tokens of context. This enables some pretty unique use cases.
  • For example, you could feed a stack trace along with your entire code base into Code Llama to help you diagnose the error.

The Many Flavors of Code Llama​

Code Llama has 3 main flavors of models:
  • Code Llama (vanilla): fine-tuned from Llama 2 for language-agnostic coding tasks
  • Code Llama - Python: further fine-tuned on 100B tokens of Python code
  • Code Llama - Instruct: further fine-tuned to generate helpful (and safe) answers in natural language

For each of these models, different versions have been trained with varying levels of parameter counts to accommodate different computing & latency arrangements:
  • 7 billion (or 7B for short): can be served on a single NVIDIA GPU (without quantization) and has lower latency
  • 13 billion (or 13B for short): more accurate but a heavier GPU is needed
  • 34 billion (or 34B for short): slower, higher performing, but has the highest GPU requirements

For example, the Code Llama - Python variant with 7 billion parameters is referenced as Code-Llama-7b across this post and across the webs. Also, here's Meta’s diagram comparing the model training approaches:

64f09cd62b3e87a721b49997_tree.png

Model Quantization

To take advantage of XetHub’s ability to mount the model files to your local machine, they need to be hosted on XetHub. To run the models locally, we’ll be using the XetHub mirror of the CodeLlama models quantized by TheBloke (aka Tom Jobbins) . You'll notice that datasets added to XetHub also get deduplicated to reduce the repo size.

64f0f8f53cbc3508be3f464e_repo.png


Tom has published models for each combination of model type and parameter count. For example, here’s the HF repo for CodeLlama-7B-GGUF. You’ll notice that each model type has multiple quantization options:

64f09ce92b3e87a721b4a274_table2.png

The CodeLlama-7B model alone has 10 different quantization variants. Generally speaking, the higher the bits (8 vs 2) used in the quantization process, the higher the memory needed (either standard RAM or GPU RAM), but the higher the quality.

GGML vs GGUF

The llama.cpp community initially used the .ggml file format to represent quantized model weights but they’ve since moved onto the .gguf file format. There are a number of reasons and benefits of the switch, but 2 of the most important reasons include:
  • Better future-proofing
  • Support for non-llama models in llama.cpp like Falcon
  • Better performance

Pre-requisites​

In an earlier post, I cover how to run the Llama 2 models on your MacBook. That postcovers the pre-reqs you need to run any ML model hosted on XetHub. Follow steps 0 to 3 and then come back to this post. Also make sure you’ve signed the license agreement from Meta and you aren’t violating their community license.


Once you’re setup with PyXet, XetHub, and you’ve compiled llama.cpp for your laptop, run the following command to mount the XetHub/codellama repo to your local machine:
xet mount --prefetch 32 xet://XetHub/codellama

This should finish in just a few seconds because all of the model files aren’t being downloaded to your machine. As a reminder, the XetHub for these models live at this link.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

Running the Smallest Model​

Now, you can run any Code Llama model you like by changing which model file you point llama.cpp to. The model file you need will be downloaded and cached behind the scenes.

llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b.Q2_K.gguf \
--prompt "In Snowflake SQL, how do I count the number of rows in a table?"


Here’s a breakdown of the code:
  • llama.cpp/main -ngl 1 : when compiled appropriately, specifies the number of layers (1) to run on the GPU (increasing performance)
  • -model codellama/GGUF/7b/codellama-7b.Q2_K.gguf: path to the model we want to use for inference. This is a 8-bit quantized version of the codellama-7b model
  • -prompt "In Snowflake SQL, how do I count the number of rows in a table?" : the prompt we want the model to respond to

And now we wait a few minutes! Depending on your internet connection, it might take 5-10 minutes for your computer to download the model file behind the scenes the first time. Subsequent predictions with the same model will happen in under a second.

64f09d94a9d2a9ef6e303d60_smallest_model.png

Comparing Instruct with Python​

Let’s ask the following question to the codellama-7b-instruct and the codellama-7b-python variants, both quantized to 8 bits: “How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet”
llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b-instruct.Q8_0.gguf \
--prompt "How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet"


Here’s the output from codellama-7b-instruct:

64f09ded9faee25e04c99958_first_prompt.png

Next let’s try codellama-7b-python:
llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b-python.Q8_0.gguf \
--prompt "How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet"


Here’s the output:

64f09e10dfcc4cc203da9282_second_prompt.png

For this specific example and run, the codellama-7b-python model variant returns an accurate response while the generic codellama-7b-instruct one seems to give an inaccurate one. Running the same prompt again often yields different responses, so it’s very challenging to reliably return responses with quantized models. They are definitely not deterministic.

Comparing 2 Bit with 8 Bit Models​

Let’s now try asking a SQL code generation question to a 2 bit vs an 8 bit quantized model version of codellama-7b-instruct.

Here’s the command to submit the prompt to the 2 bit version:
llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b-instruct.Q2_K.gguf \
--prompt “Write me a SQL query that returns the total revenue per day if I have a Timestamp column (of type timestamp) and a Revenue_per_timestamp column (of type float). Only return the SQL query syntax.”


Here's the output:

64f09ef1fed4049a9c87579e_2bit.png

From this response, we can actually see some leakage from the underlying dataset (likely StackOverflow). Let's submit the prompt to the 8 bit version now:
llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b-instruct.Q8_0.gguf \
--prompt “Write me a SQL query that returns the total revenue per day if I have a Timestamp column (of type timestamp) and a Revenue_per_timestamp column (of type float). Only return the SQL query syntax.”


Here’s the output:

64f09f282502f26530401a83_8bit.png

This response returns a useful answer without leaking any underlying data and overall the 8 bit version seems to provide more helpful responses than the 2 bit version. Sadly, neither answer lives up to the experience that ChatGPT provides but Code Llama is at least open source and can be fine tuned on private data safely.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

A complete guide to fine-tuning Code Llama​

In this guide I show you how to fine-tune Code Llama to become a beast of an SQL developer. For coding tasks, you can generally get much better performance out of Code Llama than Llama 2, especially when you specialise the model on a particular task:
  • I use the b-mc2/sql-create-context which is a bunch of text queries and their corresponding SQL queries
  • A Lora approach, quantizing the base model to int 8, freezing its weights and only training an adapter
  • Much of the code is borrowed from alpaca-lora, but I refactored it quite a bit for this
I used an A100 GPU machine with Python 3.10 and cuda 11.8 to run this notebook. It took about an hour to run. (I also tested that this code works on Colab Pro.)
*This the corresponding notebook.

1. Pip installs​

Code:
pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
pip install wandb


2. Loading libraries​

Code:
from datetime import datetime
import os
import sys

import torch
from peft import (
LoraConfig,
get_peft_model,
get_peft_model_state_dict,
prepare_model_for_int8_training,
set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq



(If you have import errors, try restarting your Jupyter kernel)

3. Load dataset​

This pulls the dataset from the Huggingface Hub and splits 10% of it into an evaluation set to check how well the model is doing through training:
Code:
from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]


If you want to load your own dataset do this:
Code:
train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')


And if you want to view any samples in the dataset just do something like:
print(train_dataset[3])

4. Load model​

I load code llama from huggingface in int8 the standard for Lora:
Code:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")


torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.
If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.

5. Check base model​

Check whether the model can already do what we want it to do:
Code:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
Code:
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


I get the output:
SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'

which is clearly wrong if the input is asking for just class so ahead with the fine-tuning!

6. Tokenization​

Setup some tokenization settings like left padding because it makes training use less memory:
Code:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"


Setup the tokenize function to make labels and input_ids the same. This is basically what self-supervised fine-tuning is:
Code:
def tokenize(prompt):
result = tokenizer(
prompt,
truncation=True,
max_length=512,
padding=False,
return_tensors=None,
)

# "self-supervised learning" means the labels are also the inputs:
result["labels"] = result["input_ids"].copy()

return result


And run convert each data_point into a prompt that I found online that works quite well:
def generate_and_tokenize_prompt(data_point):
full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
Code:
### Input:
{data_point["question"]}

### Context:
{data_point["context"]}

### Response:
{data_point["answer"]}
"""
return tokenize(full_prompt)


Reformat to prompt and tokenize each sample into our tokenized train and eval datasets:
Code:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)


7. Setup Lora​

Setup standard Lora config and attach it to the base model:
Code:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
To resume from a checkpoint, set resume_from_checkpoint to the path of the adapter_model.bin you want to resume from:
resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from
Code:
if resume_from_checkpoint:
if os.path.exists(resume_from_checkpoint):
print(f"Restarting from {resume_from_checkpoint}")
adapters_weights = torch.load(resume_from_checkpoint)
set_peft_model_state_dict(model, adapters_weights)
else:
print(f"Checkpoint {resume_from_checkpoint} not found")
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827
Optional stuff to setup Weights and Biases to view training graphs:
Code:
wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
os.environ["WANDB_PROJECT"] = wandb_project



if torch.cuda.device_count() > 1:
# keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
model.is_parallelizable = True
model.model_parallel = True

8. Training​

All the variables are standard stuff that I wouldn't recommend messing with:
batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
warmup_steps=100,
max_steps=400,
learning_rate=3e-4,
fp16=True,
logging_steps=10,
optim="adamw_torch",
evaluation_strategy="steps", # if val_set_size > 0 else "no",
save_strategy="steps",
eval_steps=20,
save_steps=20,
output_dir=output_dir,
load_best_model_at_end=False,
group_by_length=True, # group sequences of roughly the same length together to speed up training
report_to="wandb", # if use_wandb else "none",
run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
)

trainer = Trainer(
model=model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=training_args,
data_collator=DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
),
)

(If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run.)
Then we do some pytorch-related optimisation which just make training faster but don't affect accuracy:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
print("compiling the model")
model = torch.compile(model)

trainer.train()

This ^ will run for about 1 hour on an A100.

Load the final checkpoint​

Now for the moment of truth! Has our work paid off...?
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

To load a fine-tuned Lora/Qlora adapter use PeftModel.from_pretrained. output_dir should be something containing an adapter_config.json and adapter_model.bin:
Code:
from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)


Try the same prompt as before:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
Code:
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))



And the model outputs:
SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"

So it works! We've fine-tuned a model and it actually improves...If you have any questions, shoot me an email at sam[at]ragntune.com.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

About​

Powered by AI Playgrounds by E2B. Code interpreter on steroids for ChatGPT. Run any language, any terminal process, use filesystem freely. All with access to the internet.

Code Interpreter on steroids for ChatGPT (by e2b)​


This plugin is powered by E2B's AI Playgrounds


E2B plugin for ChatGPT is like code interpreter on steroids.

We give your ChatGPT instance access to a full cloud environment that's sandboxed. That means:

  • Access to Linux OS
  • Install programs
  • Using filesystem (create, list, and delete files and dirs)
  • Run processes
  • Sandboxed - you can run any code
  • Access to the internet
These cloud instances are meant to be used for agents. Like a sandboxed playgrounds, where the agent can do whatever it wants.

👉 This plugin is powered by the E2B API. If you'd like an early access, join our Discord and send us a message! 👈

❓ What can I do with this plugin?​

This plugin exposes 3 simple commands (see the OpenAPI file):

  • RunCommand
    • Runs any shell command
  • ReadFile
    • Reads file on path
  • WriteFile
    • Writes content to a file on a path
These simple 3 primitives enable a whole set of possibilities.

Here is a few ideas what you can do with these commands:

  • Run any language, not just Python. Currently supported out of the box:
    • Nodejs
    • Go
    • Bash
    • Rust
    • Python3
    • PHP
    • Java
    • Perl
    • .NET
      Please open an issue if you want us to support another language
  • Install headless chrome (go wild!)
  • Run databases
  • Start servers
  • Run terminal commands
  • Create long running processes
  • Deploy websites
  • Install programs via terminal
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827




All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows


Project description​

68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6e65756d6c2f74787461692f6d61737465722f6c6f676f2e706e67

All-in-one embeddings database
Version GitHub last commit GitHub issues Join Slack Build Status Coverage Status
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
architecture

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling, retrieval augmented generation and more.
Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.
Summary of txtai features:
  • 🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing
  • 📄 Create embeddings for text, documents, audio, images and video
  • 💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarizations and more
  • ↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.
  • ⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
  • ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)
txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers and FastAPI. txtai is open-source under an Apache 2.0 license.

Why txtai?​

why





Semantic search​

1*p_gInlWA9OTHoet3LKr2wA.png

Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
The basic use case for an embeddings database is building an approximate nearest neighbor (ANN) index for semantic search. The following example indexes a small number of text entries to demonstrate the value of semantic search.
from txtai import Embeddings

# Works with a list, dataset or generator
data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]

# Create an embeddings
embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

# Create an index for the list of text
embeddings.index(data)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change",
"public health story", "war", "wildlife", "asia",
"lucky", "dishonest junk"):
# Extract uid of first result
# search result format: (uid, score)
uid = embeddings.search(query, 1)[0][0]

# Print text
print("%-20s %s" % (query, data[uid]))
0*Q_jg8PA8WQktAehN.png

The example above shows that for all of the queries, the query text isn’t in the data. This is the true power of transformers models over token based search. What you get out of the box is 🔥🔥🔥!

Updates and deletes​

Updates and deletes are supported for embeddings. The upsert operation will insert new data and update existing data
The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.
# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])

# Create a copy of data to modify
udata = data.copy()

# Update data
udata[0] = "See it: baby panda born"
embeddings.upsert([(0, udata[0], None)])

uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", udata[uid])

# Remove record just added from index
embeddings.delete([0])

# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", udata[uid])
Initial: Maine man wins $1M from $25 lottery ticket
After update: See it: baby panda born
After delete: Maine man wins $1M from $25 lottery ticket

Persistence​

Embeddings can be saved to storage and reloaded.
embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])
Canada's last fully intact ice shelf has suddenly collapsed, forming a
Manhattan-sized iceberg

Hybrid search​

While dense vector indexes are by far the best option for semantic search systems, sparse keyword indexes can still add value. There may be cases where finding an exact match is important.
Hybrid search combines the results from sparse and dense vector indexes for the best of both worlds.
# Create an embeddings
embeddings = Embeddings(
hybrid=True,
path="sentence-transformers/nli-mpnet-base-v2"
)

# Create an index for the list of text
embeddings.index(data)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change",
"public health story", "war", "wildlife", "asia",
"lucky", "dishonest junk"):
# Extract uid of first result
# search result format: (uid, score)
uid = embeddings.search(query, 1)[0][0]

# Print text
print("%-20s %s" % (query, data[uid]))
0*Q_jg8PA8WQktAehN.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

I replaced 50 lines of code with a single LLM prompt​

haihai_in_blocks.png

Midjourney prompt: blocks of code matching the color pallet of (haihai logo)
I recently needed to write some code to compare two mailing addresses. Seems easy enough. For instance, it's pretty easy to figure out that these two represent different locations:

"123 Main St, Brooklyn, NY 11217" "29 St Marks Place, Brooklyn, NY 11217" But what about these two?

"123 Main St, Brooklyn, NY 11217" "123 Main Street, Brooklyn, NY 11217"
Or these:

"123 Main St, Brooklyn, NY 11217" "123 Main St. Brooklyn, NY 11217" "123 MAIN ST. BROOKLYN, NY 11217" "123 Main St, Brooklyn, NY 11217-3437"
The edge cases were endless. I spent an entire afternoon writing string-matching heuristics, regular expressions, and even implementing Levenshtein distance to answer the simple question, "is this address the same as that one?"

But then, on a lark, I replaced all that code – 50+ lines in all – with a single call to GPT. And within ten minutes and just a few lines of code, I hit 100% accuracy against my test suite!

That experience raised some super interesting questions:

  • Where can you use LLMs to solve day-to-day programming problems faster, with fewer lines of code?
  • Which problems are best suited for LLM replacement?
  • What are the cost and performance implications of LLM replacement?
  • Can we keep a high bar of operational excellence in production?
This post will dive into these topics and lots more.


Tinkering with LLMs​

Like many software developers, I’ve spent 2023 tinkering with LLMs, using every new AI coding tool that comes out, and racing to add generative AI experiences into our products.

But only recently have I started using LLMs to solve day-to-day programming problems, replacing more traditional algorithms, data structures, and heuristics with API calls to language models.

When talking to folks about how they are using LLMs – which I do a lot! – the conversation generally fits into 4 categories:

  1. Standalone LLM interfaces via web/mobile interfaces (e.g. ChatGPT, Midjourney)
  2. AI features integrated into tools (e.g. GItHub Copilot, Adobe’s Generative Fill, Notion’s “Start writing with AI…”)
  3. AI specific products (Twilio’s CustomerAI, Intercom’s Fin support bot)
  4. Analysis and back office tools (e.g. analyze our company’s PDF with Langchain)
In each case, AI functionality is directly exposed to the user and it’s part of the product or feature. But I’ve had surprisingly few conversations about using AI in code. As software developers, LLMs give us a fascinating new tool in our tool belts to solve day-to-day programming problems (even if our company hasn’t yet changed its name from .com to .ai 🧌).

Having started using LLMs to solve traditional programming problems recently, I was surprised at how different a way of thinking it is, and how many of our legacy tools and best practices aren’t working well in this new paradigm.

As such, I’ll be documenting my experiences and learnings as we go. This is the first post in a series that will discuss how we’ve successfully (🤞) used prompts instead of writing business logic in code, how to identify good candidates for such an approach, how to operationalize and maintain such code, limitations and risks, and more.

A Real World Example: Matching Addresses​

Rather than rambling on abstractly, let’s jump in with the real world coding example that inspired this post.

Quick context: My company, QuitCarbon, helps homeowners transition off fossil fuels appliances. As part of this process, we analyze how much natural gas (methane) a family uses, which we can determine from their utility bill. But when importing utility bills, we want to make sure that the service address on their utility bill matches the address of their property. We wouldn’t want to import data from one property into another by accident; multi-tenancy bugs are bad news bears.

Let’s consider a home at “123 Main St, Brooklyn, NY 11217”. We want to be sure that the service address on the bill matches. So in our code, we want to call a function like this:

if property.address.matches?( service_address ) # continue importing data
That should be a trivial string comparison:

def matches?(service_address) self.to_s == service_address end
Hahaha this failed on literally the very first utility bill I checked! It had the address written as “123 MAIN ST, BROOKLYN, NY 11217”. No problem:

def matches?(service_address) self.to_s.downcase == service_address.downcase end
That test passed! Next up, “123 Main Street, Brooklyn, NY 11217”. Notice the “St” vs “Street.” OK so we just need to replace "street" with "st" and repeat for all common abbreviations like Ct, Rd, St, and Ave. Maybe something like:

def matches?(service_address) self.to_s.downcase.gsub(‘street’,’st’) == service_address.downcase.gsub(‘street’,’st’) end
Probably also need to do it for Ter, Cir, Way, Pl, Blvd. Are there others? And annoyingly, what do we do with “123 St Marks Pl”? More heuristics! Zip vs Zip+4? More heuristics!! Brooklyn vs New York? MORE HEURISTICS!!!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827
OK this is getting tedious and I'll never get it all. My next attempt was to switch to fuzzier string matching via Levenshtein distance. We can compare how closely the address strings match. And if they’re lexicographically close, we can assume they match:

> Levenshtein('123 Main St, Brooklyn, NY 11217', '29 St Marks Place, Brooklyn, NY 11217') => 13 # way far off

> Levenshtein('123 Main St, Brooklyn, NY 11217', '123 Main Street, Brooklyn, NY 11217') => 4 # quite close
Great! Small differences, say, less than 10 (hella arbitrary), mean that there’s a very good chance they match.

> Levenshtein('123 Main St, Brooklyn, NY 11217', '124 Main St, Brooklyn, NY 11217') => 1 # ruh roh
Damn! These are so clearly different properties even though they only differ by a single character! (yes, this was a real world example – we had a homeowner with multiple properties next to one another).

Alright, so it’s now been a couple mind-numbing hours of adding more test cases and more and more heuristics for each edge case. I can keep adding if-statements and keep increasing our test suite, and yeah, I could probably get a pretty darn reliable function written after another hour or two. But, there’s gotta be…

A Better Way!​

It’s 2023 after all. So on a lark, I deleted all my code and rewrote it like this:

def matches?(service_address) prompt = "I will give you two addresses, potentially in different formats." prompt << "You will reply 'Yes' if there is a good chance they represent the same property or 'No' if they likely do not." prompt << "You will not provide any explanations." prompt << "Here are the two properties I want you to compare: #{self.to_s} and #{service_address}" OpenAI.chat(prompt).response =~ /Yes/ end
Whoa! I simply told the LLM the logic I wanted. It took fewer than 5 minutes to write and got 90%+ accuracy against our test suite on the first try!

Can we futz around^H^H^H prompt engineer a little bit and improve this? Here was my 2nd attempt:

# Formatting and bullets provided for readability for you, mere mortal

I will give you two addresses, potentially in different formats.

You will reply 'Yes' if there is a good chance they represent the same property or 'No' if they likely do not. You will not provide any explanations.

Code:
* For example, given "123 Main St" and "123 Main Street" you would reply "Yes" because one is an abbreviation.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St" you would reply "Yes" because one just has more specificity but they are likely the same.
* Given "123 MAIN ST" and "123 Main St" you would reply "Yes" because it's just a case difference..
* Given "123 Main St" and "124 Main St" you would reply "No" because they are different street numbers.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St, Baltimore MD 21208" you would reply "No" because they are different cities.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St, New York, NY 11217" you would reply "yes" because they have the same street and zip code and Brooklyn is either the same or next to New York.

Here are the two properties I want you to compare: #{self.to_s} and #{service_address}
And BOOM! 100%(!) accuracy against our test suite with just 2 prompt tries and under 10 minutes of programming time!

Third Time’s the Charm

OK so I’m super happy with the accuracy and almost ready to ship it, but the battle-scarred graybeard in me is a little worried that when we start seeing edge cases in production, we won’t be able to determine why we got a given result, which will make future prompt improvements hard.

So we made one small change to the Reply ‘Yes’ or ‘No’ part to be:

Code:
You will respond with JSON and only JSON without extra prose.

One key will be 'result' and the possible values are Yes and No.

The other key will be 'reason' and include a free text explanation of why you chose Yes or No.

And check out the final result for “123 Main Street” vs “123 MAIN ST”

{
  result: Yes,
  reason: The addresses are very similar, with only minor differences in formatting (lowercase vs uppercase). The street name 'Main Street' is abbreviated as 'MAIN ST' in the second address. However, the numbers '123' are the same in both addresses, indicating that they likely refer to the same property.”
}
Wawaweewah! We can now add a simple

log.debug response[‘reason’] for future debugging and our code can check response[‘result’] =~ /Yes/
And voila we’re done! A fraction of the engineering time AND more accurate results. You got your chocolate in my peanut butter.

Technical details and considerations​

Determining viable problem candidates for LLMs​

What made this a good candidate for an LLM? First, there was lots of string manipulation. Whenever I find myself doing lots of string interpolation, substitution, and regexing, that’s a good flag to think LLM. Next, I was layering on edge cases and heuristics and if statements; there wasn’t a clear algorithm or science behind my work. That’s likely another good flag that AI could be a good solution.

Using Prompts for Flow Control​

Because we’re using this for flow control (if-then-else block), we want deterministic response formats. GPT is quite good at respecting directives like “Only reply Yes or No” and “You will not provide any explanations.” Those are critical to avoid responses like “123 Main St and 321 Main St are not the same”. Once I was confident in the “Yes” or “No” response, it was easy to just check for that string.

To get more determinism, we also want to set the Temperature on the API request. Temperature is a value between 0.1 and 2.0 and controls the desired level of randomness and creativity. Since we do not want any creativity nor randomness in our responses, we set the temperature as low as possible, which helped.

Model Selection​

We ran our test suite against both gpt4 and gpt3.5-turbo and found no differences, so we chose the latter for better speed and lower cost.

Performance​

This function is called in an asynchronous background job run infrequently, so the performance implications – potentially up to a second or more – isn’t concerning

Cost​

Since it’s called only once per customer, the costs are fairly trivial. Although a more interesting perspective is this: if the LLM approach saved just 1 hour of Engineering time, we could process over 100,000(!) utility bills and it would still be cheaper.

Conclusion​

The takeaway here is that the product or feature or customer experience in question has nothing to do with AI. It’s a different way to approach traditional programming problems. Just like we can solve a problem differently by using an array or a hash, by swapping out our sort algorithm, or by moving business logic from application code to SQL, LLMs give us a new tool in our tool belts to approach software problems.

Up Next​

Writing this simple function and deploying it to production - while keeping a high bar of operational excellence – raised a TON more questions, many of which will be discussed in future posts:

  • Identifying software problems that are good candidates for AI
  • Crafting prompts for application logic and flow control, not for creativity
  • Operational excellence: architecting for reliability
  • Change management for prompts
  • Monitoring response accuracy and quality
  • Tracking costs, token counts, and optimizations
  • How to do test driven development with LLMs
  • Performance implications of using LLMs in production
  • Multisourcing vendors and LLM redundancy

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

Asking 60+ LLMs a set of 20 questions​


Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows.


I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API.


The script stored all the answers in a SQLite database, and those are the raw results.


view: all prompts / all models


reflexion:
  • Argue for and against the use of kubernetes in the style of a haiku.

    results
  • Give two concise bullet-point arguments against the Münchhausen trilemma (don't explain what it is)

    results
  • I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with?
    Let's think step by step.

    results
  • Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

    results
  • Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step.

    results
knowledge:
  • Explain in a short paragraph quantum field theory to a high-school student.

    results
  • Is Taiwan an independent country?

    results
  • Translate this to French, you can take liberties so that it sounds nice: "blossoms paint the spring, nature’s rebirth brings delight and beauty fills the air."

    results
code:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

US Copyright Office denies protection for another AI-created image​

By Blake Brittain

September 6, 20236:20 PM EDTUpdated 5 days ago

FILE PHOTO: Illustration shows AI (Artificial Intelligence) letters and computer motherboard

AI (Artificial Intelligence) letters are placed on computer motherboard in this illustration taken, June 23, 2023. REUTERS/Dado Ruvic/Illustration Acquire Licensing Rights
  • Summary
  • Midjourney-created "space opera" artwork not protectable, office says
  • Office has previously rejected copyrights for AI-generated work

Sept 6 (Reuters) - The U.S. Copyright Office has again rejected copyright protection for art created using artificial intelligence, denying a request by artist Jason M. Allen for a copyright covering an award-winning image he created with the generative AI system Midjourney.

The office said on Tuesday that Allen's science-fiction themed image "Theatre D'opera Spatial" was not entitled to copyright protection because it was not the product of human authorship.

The Copyright Office in February rescinded copyrights for images that artist Kris Kashtanova created using Midjourney for a graphic novel called "Zarya of the Dawn," dismissing the argument that the images showed Kashtanova's own creative expression. It has also rejected a copyright for an image that computer scientist Stephen Thaler said his AI system created autonomously.

Allen said on Wednesday that the office's decision on his work was expected, but he was "certain we will win in the end."

"If this stands, it is going to create more problems than it solves," Allen said. "This is going to create new and creative problems for the copyright office in ways we can't even speculate yet."

Representatives for Midjourney did not immediately respond to a request for comment on the decision on Wednesday.

Allen applied last September to register a copyright in "Theatre D'opera Spatial," an image evoking a futuristic royal court that won the Colorado State Fair's art competition in 2022. A Copyright Office examiner requested more information about Midjourney's role in creating the image, which had received national attention as the first AI-generated work to win the contest.


Allen told the office that he "input numerous revisions and text prompts at least 624 times to arrive at the initial version of the image" using Midjourney and altered it with Adobe Photoshop.

The office asked Allen to disclaim the parts of the image that Midjourney generated in order to receive copyright protection. It rejected Allen's application after he declined.

The office's Copyright Review Board affirmed the decision on Tuesday, finding the image as a whole was not copyrightable because it contained more than a minimal amount of AI-created material.


The office also rejected Allen's argument that denying copyrights for AI-created material leaves a "void of ownership troubling to creators."

Read more:
Humans vs. machines: the fight to copyright AI art
AI-created images lose U.S. copyrights in test for new technology
AI-generated art cannot receive copyrights, US court says

Reporting by Blake Brittain in Washington



Theatre D'opera Spatial
GgOFe7p.jpeg
 
Top