bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808




Collective Cognition v1.1 - Mistral 7B

Collective Cognition Logo
Model Description:
Collective Cognition v1.1 is a state-of-the-art model fine-tuned using the Mistral approach. This model is particularly notable for its performance, outperforming many 70B models on the TruthfulQA benchmark. This benchmark assesses models for common misconceptions, potentially indicating hallucination rates.

Special Features:
Quick Training: This model was trained in just 3 minutes on a single 4090 with a qlora, and competes with 70B scale Llama-2 Models at TruthfulQA.
Limited Data: Despite its exceptional performance, it was trained on only ONE HUNDRED data points, all of which were gathered from a platform reminiscent of ShareGPT.
Extreme TruthfulQA Benchmark: This model is competing strongly with top 70B models on the TruthfulQA benchmark despite the small dataset and qlora training!
image/png

Acknowledgements:
Special thanks to @a16z and all contributors to the Collective Cognition dataset for making the development of this model possible.

Dataset:
The model was trained using data from the Collective Cognition website. The efficacy of this dataset is demonstrated by the model's stellar performance, suggesting that further expansion of this dataset could yield even more promising results. The data is reminiscent of that collected from platforms like ShareGPT.

You can contribute to the growth of the dataset by sharing your own ChatGPT chats here.
You can download the datasets created by Collective Cognition here: CollectiveCognition (Collective Cognition)

Performance:
TruthfulQA: Collective Cognition v1.1 has notably outperformed various 70B models on the TruthfulQA benchmark, highlighting its ability to understand and rectify common misconceptions.
Usage:
Prompt Format:

USER: <prompt>
ASSISTANT:
OR

<system message>
USER: <prompt>
ASSISTANT:
Benchmarks:
Collective Cognition v1.0 TruthfulQA:


Code:
|    Task     |Version|Metric|Value |   |Stderr|
|-------------|------:|------|-----:|---|-----:|
|truthfulqa_mc|      1|mc1   |0.4051|±  |0.0172|
|             |       |mc2   |0.5738|±  |0.0157|

Collective Cognition v1.1 GPT4All:

Code:
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.5085|±  |0.0146|
|             |       |acc_norm|0.5384|±  |0.0146|
|arc_easy     |      0|acc     |0.7963|±  |0.0083|
|             |       |acc_norm|0.7668|±  |0.0087|
|boolq        |      1|acc     |0.8495|±  |0.0063|
|hella_swag   |      0|acc     |0.6399|±  |0.0048|
|             |       |acc_norm|0.8247|±  |0.0038|
|openbookqa   |      0|acc     |0.3240|±  |0.0210|
|             |       |acc_norm|0.4540|±  |0.0223|
|piqa         |      0|acc     |0.7992|±  |0.0093|
|             |       |acc_norm|0.8107|±  |0.0091|
winogrande    |      0 acc     7348 ±   0124

Average: 71.13

AGIEval:


Code:
Task                          Version Metric Value    ±   Stderr
agieval_aqua_rat              0        acc    01929 ±   0248
                              acc_norm        02008 ±   0252
agieval_logiqa_en             0        acc    03134 ±   0182
                              acc_norm        03333 ±   0185
agieval_lsat_ar               0        acc    02217 ±   0275
                              acc_norm        02043 ±   0266
agieval_lsat_lr               0        acc    03412 ±   021
                              acc_norm        03216 ±   0207
agieval_lsat_rc               0        acc    04721 ±   0305
                              acc_norm        04201 ±   0301
agieval_sat_en                0        acc    06068 ±   0341
                              acc_norm        05777 ±   0345
agieval_sat_en_without_passage
                              acc            -03932 ±   -0341
                              acc_norm       -03641 ±   -0336
agieval_sat_math              acc            -02864 ±   -0305
                              acc_norm       -02636 ±   -0298

Average: 33.57
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808
Want a free and open source alternative to ChatGPT Vision?

Check out LLaVA!

Based on LLaMA, it is comparable in performance.

How To Install LLaVA 👀 Open-Source and FREE "ChatGPT Vision"​

Download

8,419​

577​

Genre: Science & Technology


Matthew Berman
Subscribe | 91.3K
Shared October 12, 2023
Re-uploaded. Audio fixed. Sorry about that. In this video, I show you how to install LLaVA, which is like ChatGPT Vision but completely free and open-source. I use RunPod, but you can install this with Linux or WSL on Windows just as quickly.




Links: Runpod - bit.ly/3OtbnQx
Free Demo - llava-vl.github.io/
GitHub - github.com/haotian-liu/LLaVA
LLaVA Runpod Template - bit.ly/3FkiaXb
Instructions - gist.github.com/mberman84/e3418c826306e614e04affde…
ChatGPT Vision Examples - • The Most INSANE ChatGPT Vision Uses 👀...
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808


CevjaxP.png


https://colab.research.google.com/drive/16Ygf2IyGNkb725ZqtRmFQjwWBuzFX_kl?usp=sharing#scrollTo=Jo11kC_QF7_K
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808


PEI0A4l.png


Bash:
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.

We also provide our self-reflexion prompt, strongly inspired by many initiatives from the AI community.

Code:
You're given a list of moderation categories as below:
- illegal: Illegal activity.
- child abuse: child sexual abuse material or any content that exploits or harms children.
- hate violence harassment: Generation of hateful, harassing, or violent content: content that expresses, incites, or promotes hate based on identity, content that intends to harass, threaten, or bully an individual, content that promotes or glorifies violence or celebrates the suffering or humiliation of others.
- malware: Generation of malware: content that attempts to generate code that is designed to disrupt, damage, or gain unauthorized access to a computer system.
- physical harm: activity that has high risk of physical harm, including: weapons development, military and warfare, management or operation of critical infrastructure in energy, transportation, and water, content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
- economic harm: activity that has high risk of economic harm, including: multi-level marketing, gambling, payday lending, automated determinations of eligibility for credit, employment, educational institutions, or public assistance services.
- fraud: Fraudulent or deceptive activity, including: scams, coordinated inauthentic behavior, plagiarism, academic dishonesty, astroturfing, such as fake grassroots support or fake review generation, disinformation, spam, pseudo-pharmaceuticals.
- adult: Adult content, adult industries, and dating apps, including: content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness), erotic chat, pornography.
- political: Political campaigning or lobbying, by: generating high volumes of campaign materials, generating campaign materials personalized to or targeted at specific demographics, building conversational or interactive systems such as chatbots that provide information about campaigns or engage in political advocacy or lobbying, building products for political campaigning or lobbying purposes.
- privacy: Activity that violates people's privacy, including: tracking or monitoring an individual without their consent, facial recognition of private individuals, classifying individuals based on protected characteristics, using biometrics for identification or assessment, unlawful collection or disclosure of personal identifiable information or educational, financial, or other protected records.
- unqualified law: Engaging in the unauthorized practice of law, or offering tailored legal advice without a qualified person reviewing the information.
- unqualified financial: Offering tailored financial advice without a qualified person reviewing the information.
- unqualified health: Telling someone that they have or do not have a certain health condition, or providing instructions on how to cure or treat a health condition.

Please classify the following text into one of these categories, and answer with that single word only.
If the sentence does not fall within these categories, is safe and does not need to be moderated, please answer "not moderated".

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808

There’s a LOT of LLMs, but how do we know which ones work well from “simple” tasks (single prompt, top-k RAG) to “hard” tasks (advanced RAG, agents)?

We’re excited to launch a comprehensive survey of different LLMs performing simple to hard LLM/RAG/agent tasks 📝. For each model, learn which tasks work out-of-the-box, which would okay but need some prompt engineering, and which ones are unreliable.

Models used:
✅ OpenAI models (gpt-3.5-turbo, gpt-3.5-turbo-instruct, gpt-4)
✅ Anthropic models (claude-2, Claude-instant-2)
✅ llama2-chat-7b 4bit
✅ Mistral-7b

Tasks 🛠️: Basic RAG, routing, query planning, text-to-SQL, structured data extraction, agents!

Results/Notebooks 🧑‍🔬:
Docs page is here: https://docs.llamaindex.ai/en/latest/core_modules/model_modules/llms/root.html#llm-compatibility-tracking

We have comprehensive notebooks for each model

Contributions 🙌:
Have a model / task in mind? Anyone is welcome to contribute new LLMs to our docs, or modify an existing one! (e.g. if you think our defaults/prompts can be improved).

Credits:
Huge shoutout to our very own @LoganMarkewich for driving this entire effort ⚡
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808


We’ve seen a massive amount of progress in AI/LLM research over the last several weeks. Here are the five highest-impact papers/projects that I’ve been focusing on recently…

StreamingLLM solves limitations with LLMs generating long sequences of text. To avoid excessive memory usage in the KV cache, StreamingLLM only considers a window of recent tokens in the attention computation, as well as four “sink” tokens at the start of the sequence. This allows extremely long sequences of text (4M tokens) to be generated with stable memory usage and performance.

QA-LoRA combines quantization with low rank adaptation (LoRA) to make LLM training and inference more computationally cheap. They key to QA-LoRA is a (group-wise) quantization-aware training scheme that eliminates the need to perform post-training quantization; see below.

“QA-LoRA consistently outperforms QLoRA with PTQ on top of LLMs of different scales (the advantage becomes more significant when the quantization bit width is lower) and is on par with QLoRA without PTQ. Note that during inference, QA-LoRA has exactly the same complexity as QLoRA with PTQ and is much more efficient than QLoRA without PTQ.” - from QA-LoRA paper

Physics of LLMs is a series of papers that study the ability of language models to store/manipulate information. This work finds that language models can only retrieve information that is stored properly during pretraining and struggle to perform complex manipulations of this knowledge (beyond retrieval) without techniques like chain of thought prompting.

GPT-4V is the (long-anticipated) multi-modal extension of GPT-4 that enables the model to process both textual and visual (i.e., images) input from the user. GPT-4V was released within ChatGPT Plus, but it underwent an extensive mitigation and fine-tuning process, which is detailed in the model’s system card, to ensure safety.

LLaVA. GPT-4V is a closed source model, but open-source variants of GPT-4V have already been proposed that can execute dialogue with both textual and visual inputs. LLaVA combines the Vicuna LLM with a vision encoder to create an open-source, multi-modal language model.

Yv816Ju.jpeg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models​


Paper: https://arxiv.org/abs/2310.04406

Abstract:

While large language models (LLMs) have demonstrated impressive performance on a range of decision-making tasks, they rely on simple acting processes and fall short of broad deployment as autonomous agents. We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree search in model-based reinforcement learning, LATS employs LLMs as agents, value functions, and optimizers, repurposing their latent strengths for enhanced decision-making. What is crucial in this method is the use of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism that moves beyond the limitations of existing techniques. Our experimental evaluation across diverse domains, such as programming, HotPotQA, and WebShop, illustrates the applicability of LATS for both reasoning and acting. In particular, LATS achieves 94.4\% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808

Do yourself a favor - Install Ollama (ollama.ai/), then copy and paste this into your terminal to pull all of the available models down to your local computer:

ollama pull llama2; \
ollama pull llama2-uncensored; \
ollama pull codellama; \
ollama pull codeup; \
ollama pull everythinglm; \
ollama pull falcon; \
ollama pull llama2-chinese; \
ollama pull medllama2; \
ollama pull mistral; \
ollama pull mistral-openorca; \
ollama pull nexusraven; \
ollama pull nous-hermes; \
ollama pull open-orca-platypus2; \
ollama pull orca-mini; \
ollama pull phind-codellama; \
ollama pull samantha-mistral; \
ollama pull sqlcoder; \
ollama pull stable-beluga; \
ollama pull starcoder; \
ollama pull vicuna; \
ollama pull wizard-math; \
ollama pull wizard-vicuna; \
ollama pull wizard-vicuna-uncensored; \
ollama pull wizardcoder; \
ollama pull wizardlm;
ollama pull wizardlm-uncensored; \
ollama pull zephyr;
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808


ICXToeA.png



Med42 - Clinical Large Language Model

Med42 is an open-access clinical large language model (LLM) developed by M42 to expand access to medical knowledge. Built off LLaMA-2 and comprising 70 billion parameters, this generative AI system provides high-quality answers to medical questions.

Model Details​

Note: Use of this model is governed by the M42 Health license. In order to download the model weights (and tokenizer), please read the Med42 License and accept our License by requesting access here.

Beginning with the base LLaMa-2 model, Med42 was instruction-tuned on a dataset of ~250M tokens compiled from different open-access sources, including medical flashcards, exam questions, and open-domain dialogues.

Model Developers: M42 Health AI Team

Finetuned from model: Llama-2 - 70B

Context length: 4k tokens

Input: Text only data

Output: Model generates text only

Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance model's performance.

License: A custom license is available here

Research Paper: TBA

Intended Use​

Med42 is being made available for further testing and assessment as an AI assistant to enhance clinical decision-making and enhance access to an LLM for healthcare use. Potential use cases include:

  • Medical question answering
  • Patient record summarization
  • Aiding medical diagnosis
  • General health Q&A
To get the expected features and performance for the model, a specific formatting needs to be followed, including the <|system|>, <|prompter|> and <|assistant|> tags.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name_or_path = "m42-health/med42-70b"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "What are the symptoms of diabetes ?"
prompt_template=f'''
<|system|>: You are a helpful medical assistant created by M42 Health in the UAE.
<|prompter|>:{prompt}
<|assistant|>:
'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True,eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Hardware and Software​

The training process was performed on the Condor Galaxy 1 (CG-1) supercomputer platform.

Evaluation Results​

Med42 achieves achieves competitive performance on various medical benchmarks, including MedQA, MedMCQA, PubMedQA, HeadQA, and Measuring Massive Multitask Language Understanding (MMLU) clinical topics. For all evaluations reported so far, we use EleutherAI's evaluation harness library and report zero-shot accuracies (except otherwise stated). We compare the performance with that reported for other models (ClinicalCamel-70B, GPT-3.5, GPT-4.0, Med-PaLM 2).

DatasetMed42ClinicalCamel-70BGPT-3.5GPT-4.0Med-PaLM-2 (5-shot)*
MMLU Clinical Knowledge74.369.869.886.088.3
MMLU College Biology84.079.272.295.194.4
MMLU College Medicine68.867.061.376.980.9
MMLU Medical Genetics86.069.070.091.090.0
MMLU Professional Medicine79.871.370.293.095.2
MMLU Anatomy67.462.256.380.077.8
MedMCQA60.947.050.169.571.3
MedQA61.553.450.878.979.7
USMLE Self-Assessment71.7-49.183.8-
USMLE Sample Exam72.054.356.984.3-
*We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at GitHub - m42-health/med42.

Key performance metrics:​

  • Med42 achieves a 72% accuracy on the US Medical Licensing Examination (USMLE) sample exam, surpassing the prior state of the art among openly available medical LLMs.
  • 61.5% on MedQA dataset (compared to 50.8% for GPT-3.5)
  • Consistently higher performance on MMLU clinical topics compared to GPT-3.5.

Limitations & Safe Use​

  • Med42 is not ready for real clinical use. Extensive human evaluation is undergoing as it is required to ensure safety.
  • Potential for generating incorrect or harmful information.
  • Risk of perpetuating biases in training data.
Use this model responsibly! Do not rely on it for medical usage without rigorous safety testing.

Accessing Med42 and Reporting Issues​

Please report any software "bug" or other problems through one of the following means:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,114
Reputation
8,239
Daps
157,808

About​

What LLM to use? A perspective from the Dev+AI space

blog.continue.dev/what-llm-to-use

What LLMs are being used while coding?

How do folks decide?

The first choice you typically make is whether you are going to use an open-source or a commercial model:

  • You usually select an open-source LLM when you want to keep your code within your environment, have enough available memory, want to keep your costs low, or want to be able to manage and optimize everything end-to-end.
  • You usually select a commercial LLM when you want the best model, prefer an easy and reliable setup, don’t have a lot of available memory, don’t mind your code leaving your environment, or are not deterred by cost concerns.
If you decide to use an open-source LLM, your next decision is whether to set up the model on your local machine or on a hosted model provider:

  • You usually opt to use an open-source LLM on your local machine when you have enough available memory, want free usage, or want to be able to use the model without needing an Internet connection.
  • You usually opt to use an open-source LLM on a hosted provider when you want the best open-source model, don’t have a lot of available memory on your local machine, or want the model to serve multiple people.
If you decide to use a commercial LLM, you'll typically obtain API keys and play with multiple of them for comparison. Both the quality of the suggestions and the cost to use can be important criteria.

Open Source

This is a list of the open-source LLMs that developers are using while coding, roughly ordered from most popular to least popular, as of October 2023.

1. Code Llama

Code Llama is an LLM trained by Meta for generating and discussing code. It is built on top of Llama 2. Even though it is below WizardCoder and Phind-CodeLlama on the Big Code Models Leaderboard, it is the base model for both of them. It also comes in a variety of sizes: 7B, 13B, and 34B, which makes it popular to use on local machines as well as with hosted providers. At this point, it is the most well-known open-source base model for coding and is leading the open-source effort to create coding capable LLMs.

Details

2. WizardCoder

WizardCoder is an LLM built on top of Code Llama by the WizardLM team. The Evol-Instruct method is adapted for coding tasks to create a training dataset, which is used to fine-tune Code Llama. It comes in the same sizes as Code Llama: 7B, 13B, and 34B. As a result, it is the most popular open-source instruction-tuned LLM so far.

Details

3. Phind-CodeLlama

Phind-CodeLlama is an LLM built on top of Code Llama by Phind. A proprietary dataset of ~80k high-quality programming problems and solutions was used to fine-tune Code Llama. That fine-tuned model was then further fine-tuned on 1.5B additional tokens. It currently leads on the Big Code Models Leaderboard. However, it is only available as a 34B parameter model, so it requires more available memory to be used.

Details

4. Mistral

Mistral is a 7B parameter LLM trained by Mistal AI. It is the most recently released model on this list, having dropped at the end of September. Mistal AI says that it “approaches CodeLlama 7B performance on code, while remaining good at English tasks”. Despite only being available in the one small size, people are quite excited about it in the first couple weeks after release. The first fine-tuned LLMs that use it as their base are now beginning to emerge, and we are likely to see more going forward.

Details

5. StarCoder

StarCoder is a 15B parameter LLM trained by BigCode, which was ahead of its time when it was released in May. It was trained on 80+ programming languages from The Stack (v1.2) with opt-out requests excluded. It is not an instruction model and commands like "Write a function that computes the square root" do not work well. However, by using the Tech Assistant prompt you can make it more helpful.

Details

6. Llama2

Llama 2 is an LLM trained by Meta on 2 trillion tokens. It is the most popular open source LLM overall, so some developers use it, despite it not being as good as many of the models above at making code edits. It is also important because Code Llama, the most popular LLM for coding, is built on top of it, which in turn is the foundation for WizardCoder and Phind-CodeLlama.

Details

Commercial

This is a list of the commercial LLMs that developers are using while coding, roughly ordered from most popular to least popular, as of October 2023.

1. GPT-4

GPT-4 from OpenAI is generally considered to be the best LLM to use while coding. It is quite helpful when generating and discussing code. However, it requires you to send your code to OpenAI via their API and can be quite expensive. Nevertheless, it is the most popular LLM for coding overall and the majority of developers use it while coding at this point. All OpenAI API users who made a successful payment of $1 or more before July 6, 2023 were given access to GPT-4, and they plan to open up access to all developers soon.

2. GPT-3.5 Turbo

GPT-3.5 Turbo from OpenAI is cheaper and faster than GPT-4; however, its suggestions are not nearly as helpful. It also requires you to send your code to OpenAI via their API. It is the second most popular LLM for coding overall so far. All developers can use it now after signing up for an OpenAI account.

3. Claude 2

Claude 2 is an LLM trained by Anthropic, which has greatly improved coding skills compared to the first version of Claude. It especially excels, relative to other LLMs, when you provide a lot of context. It requires you to send your code to Anthropic via their API. You must apply to get access to Claude 2 at this point.

4. PaLM 2

PaLM 2 is an LLM trained by Google. To try it out, you must send your code to Google via the PaLM API after obtaining an API key via MakerSuite, both of which are currently in public preview.
 
Top