bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444


Extremely impressed with what @dani_avila7 is building with CodeGPT (codegpt.co/)!! 🤯

CodeGPT pairs Code Documentation RAG with a GUI. So users can pick which docs they want to chat with, LangChain, LlamaIndex, Weaviate, ... This interface is then integrated in VSCode, as well as a standard chatbot interface.

Daniel and I met to chat about our shared love of the Gorilla 🦍 project from UC Berkeley (Daniel has created an awesome HuggingFace spaces demo to test the original model from Patil et al. - huggingface.co/spaces/davila…),🤗.

We had a really interesting debate on Zero-Shot LLMs + RAG versus Retrieval-Aware Fine-tuning the LlaMA 7B model Gorilla style.

It seems like there is a massive opportunity to help developers serve their LlaMA 7B models. The Gorilla paper shows that this achieves better performance (attached) -- and although it should in principal be cheaper to serve the 7B LLM, the DIY inference tooling is behind the commercial APIs.

Gorilla is also a fantastic opportunity to explore how these advanced query engines can better pack the prompt with information.

I am very curious how code assistant agents can benefit from non-formal documentation as well, such as blog posts, StackOverflow Q&A, and maybe even podcasts.

I also think there is a huge opportunity to explore how these advanced query engines can better pack the prompt with information. For example, the Weaviate Gorilla is simply translating from natural language commands to GraphQL APIs, but as we imagine integrations deeper into codebases, we may want to jointly retrieve from the user's codebase as well as the API docs to pack the prompt.

Thank you so much for the chat Daniel! Love learning more about how people are thinking about Gorilla LLMs and LLMs for Code! 🦍😎
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444

Mistral-7B-OpenOrca







AHNP4H6.png


ERw7j18.png


K1v8N7I.png



DEMO:
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444

https://web.archive.org/web/20231003184505/https://old.reddit.com/r/LocalLLaMA/comments/16y95hk/a_starter_guide_for_playing_with_your_own_local_ai/
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444











qkHD4gp.jpeg

57kdB72.jpeg

FPlYtFY.jpeg

WmAgvdP.jpeg

kOsht8B.jpeg

snqrz5L.jpeg

OSHzQv7.jpeg

Traln the best open-source standard models

Al the end of 2023, we will train a family of text—generating models that can beat ChatGPT 3.5 and Bard March 2023 by a large margin. as well as all open source solutions

Part of this family will be open-sourced; we will engage the community to build on top of it and make it the open standard.

We will service these models with the same endpoints as our competitor for a fee to acquire third-party usage data, and create a few free consumer interfaces for trademark construction and first-party usage data.

edit:

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444


🎯AWESOME FREE AI TOOL

- Introducing Ollama LLM - Free for all to use.

- Watch my little video 📢

1. Go to: ollama.ai
2. Download
3. Write: ollama run llama2 (in the terminal)
4. Ask it a question just like ChatGPT
5. Try writing: ollama run mistral (another free AI Model)
6. More info: github.com/jmorganca/ollama

- Alternative to ChatGPT / Claude / Bard / Bing AI

- I have been playing with the Ollama LLM for the last week, and I must say it is an awesome tool.

- 2 minutes to install and then do you have your own Free AI LLM running locally om your Mac, Linux, Windows machine.

- Run Llama 2, Code Llama, and other models. Customize and create your own.

- It does not require internet and can be run without a heavy PC / Laptop.

- In this example does it answer questions about:

- Brazil
- Albert Einstein
- Make a funny tweet about a cat and a dog
- Code a basic Form.

Enjoy and let me know what you think🎁

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444

CollectiveCognition, a unique AI-powered platform for effortlessly sharing, tagging, and voting on ChatGPT generated conversations."​


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444

LLMs can be extended to infinite sequence lengths without fine-tuning​


Mike Young

Oct 2, 20234 min
LLMs trained with a finite attention window can be extended to infinite sequence lengths without any fine-tuning.

Screen-Shot-2023-10-02-at-2.20.56-PM.png

StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up 4M tokens.

In recent years, natural language processing has been revolutionized by the advent of large language models (LLMs). Massive neural networks like GPT-3, PaLM, and BlenderBot have demonstrated remarkable proficiency at various language tasks like conversational AI, summarization, and question-answering. However, a major impediment restricts their practical deployment in real-world streaming applications.

LLMs are pre-trained on texts of finite lengths, usually a few thousand tokens. As a result, their performance deteriorates rapidly when presented with sequence lengths exceeding their training corpus. This limitation renders LLMs incapable of reliably handling long conversations as required in chatbots and other interactive systems. Additionally, their inference process caches all previous tokens' key-value states, consuming extensive memory.

Researchers from MIT, Meta AI, and Carnegie Mellon recently proposed StreamingLLM, an efficient framework to enable infinite-length language modeling in LLMs without expensive fine-tuning. Their method cleverly exploits the LLMs' tendency to use initial tokens as "attention sinks" to anchor the distribution of attention scores. By caching initial tokens alongside recent ones, StreamingLLM restored perplexity and achieved up to 22x faster decoding than prior techniques.

The paper they published says it clearly:

We introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
This blog post explains the key technical findings of this work and their significance in plain terms. The ability to deploy LLMs for unlimited streaming inputs could expand their applicability across areas like assistive AI, tutoring systems, and long-form document generation. However, cautions remain around transparency, bias mitigation, and responsible use of these increasingly powerful models.

Subscribe or follow me on Twitter for more content like this!

The Challenges of Deploying LLMs for Streaming​

Unlike humans who can sustain conversations for hours, LLMs falter beyond short contexts. Two primary issues encumber their streaming application:

Memory Overhead: LLMs based on Transformer architectures cache the key-value states of all previous tokens during inference. This memory footprint balloons with sequence length, eventually exhausting GPU memory.

Performance Decline: More critically, LLMs lose their abilities when context lengths exceed those seen during pre-training. For example, a model trained on 4,000 token texts fails on longer sequences.

Real-world services like chatbots, tutoring systems, and voice assistants often need to maintain prolonged interactions. But LLMs' limited context capacity hampers their deployment in such streaming settings. Prior research attempted to expand the training corpus length or optimize memory usage, but fundamental barriers remained.

Windowed Attention and Its Limitations​

An intuitive technique called windowed attention emerged to mitigate LLMs' memory overhead. Here, only the key-values of the most recent tokens within a fixed cache size are retained. Once this rolling cache becomes full, the earliest states are evicted. This ensures constant memory usage and inference time.

However, an annoying phenomenon occurs - the model's predictive performance drastically deteriorates soon after the starting tokens fade from the cache. But why should removing seemingly unimportant old tokens impact future predictions so severely?

The Curious Case of Attention Sinks​

Analyzing this conundrum revealed the LLM's excessive attention towards initial tokens, making them act as "attention sinks." Even if semantically irrelevant, they attract high attention scores across layers and heads.

The reason lies in the softmax normalization of attention distributions. Some minimal attention must be allocated across all context tokens due to the softmax function’s probabilistic nature. The LLM dumps this unnecessary attention into specific tokens - preferentially the initial ones visible to all subsequent positions.

Critically, evicting the key-values of these attention sinks warped the softmax attention distribution. This destabilized the LLM's predictions, explaining windowed attention's failure.

StreamingLLM: Caching Initial Sinks and Recent Tokens​

Leveraging this insight, the researchers devised StreamingLLM - a straightforward technique to enable infinite-length modeling in already trained LLMs, without any fine-tuning.

The key innovation is maintaining a small cache containing initial "sink" tokens alongside only the most recent tokens. Specifically, adding just 4 initial tokens proved sufficient to recover the distribution of attention scores back to normal levels. StreamingLLM combines this compact set of anchored sinks with a rolling buffer of recent key-values relevant for predictions.

(There are some interesting parallels to a similar paper in ViT research around registers, which you can read here.)

This simple restoration allowed various LLMs like GPT-3, PaLM and LaMDA to smoothly handle context lengths exceeding 4 million tokens - a 1000x increase over their training corpus! Dumping unnecessary attention into the dedicated sinks prevented distortion, while recent tokens provided relevant semantics.

StreamingLLM achieved up to 22x lower latency than prior approaches while retaining comparable perplexity. So by removing this key bottleneck, we may be able to enable practical streaming deployment of LLMs in interactive AI systems.

Pre-training with a Single Sink Token​

Further analysis revealed that LLMs learned to split attention across multiple initial tokens because their training data lacked a consistent starting element. The researchers proposed appending a special "Sink Token" to all examples during pre-training.

Models trained this way coalesced attention into this single dedicated sink. At inference time, providing just this token alongside recent ones sufficiently stabilized predictions - no other initial elements were needed. This method could further optimize future LLM designs for streaming usage.

Conclusion​

By identifying initial tokens as attention sinks, StreamingLLM finally enables large language models to fulfill their potential in real-world streaming applications. Chatbots, virtual assistants, and other systems can now leverage LLMs to smoothly sustain long conversations.

However, while this removes a technical barrier, concerns around bias, transparency, and responsible AI remain when deploying such powerful models interacting with humans - infinite context window or not. But used judiciously under the right frameworks, the StreamingLLM approach could open up new beneficial applications of large language models.




EFFICIENT STREAMING LANGUAGE MODELSWITH ATTENTION SINKS​

ABSTRACT

Deploying Large Language Models (LLMs) in streaming applications such asmulti-round dialogue, where long interactions are expected, is urgently needed butposes two major challenges. Firstly, during the decoding stage, caching previoustokens’ Key and Value states (KV) consumes extensive memory. Secondly, popularLLMs cannot generalize to longer texts than the training sequence length. Windowattention, where only the most recent KVs are cached, is a natural approach — butwe show that it fails when the text length surpasses the cache size. We observean interesting phenomenon, namely attention sink, that keeping the KV of initialtokens will largely recover the performance of window attention. In this paper, wefirst demonstrate that the emergence of attention sink is due to the strong attentionscores towards initial tokens as a “sink” even if they are not semantically important.Based on the above analysis, we introduce StreamingLLM, an efficient frameworkthat enables LLMs trained with a finite length attention window to generalize toinfinite sequence length without any fine-tuning. We show that StreamingLLM canenable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient languagemodeling with up to 4 million tokens and more. In addition, we discover thatadding a placeholder token as a dedicated attention sink during pre-training canfurther improve streaming deployment. In streaming settings, StreamingLLMoutperforms the sliding window recomputation baseline by up to 22.2× speedup.Code and datasets are provided in the link.





 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,444

[Submitted on 3 Oct 2023]

Think before you speak: Training Language Models With Pause Tokens​

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
Language models generate responses by producing a series of tokens in immediate succession: the (K+1)th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, K+10 hidden vectors, before it outputs the (K+1)th token? We operationalize this idea by performing training and inference on language models with a (learnable) pause token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of 18% EM score on the QA task of SQuAD, 8% on CommonSenseQA and 1% accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
Comments:19 pages, 7 figures
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2310.02226 [cs.CL]
(or arXiv:2310.02226v1 [cs.CL] for this version)
[2310.02226] Think before you speak: Training Language Models With Pause Tokens
Focus to learn more

Submission history​

From: Sachin Goyal [view email]
[v1] Tue, 3 Oct 2023 17:32:41 UTC (610 KB)

 
Top