bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879

Bringing Open Large Language Models to Consumer Devices​

May 22, 2023 • MLC Community

The rapid proliferation of open-source Large Language Models (LLMs) has sparked a strong desire among diverse user groups to independently utilize their own models within local environments. This desire stems from the constant introduction of new LLM innovations, offering improved performance and a range of customizable options. Researchers, developers, companies and enthusiasts all seek the flexibility to deploy and fine-tune LLMs according to their specific needs. By running models locally, they can tap into the diverse capabilities of LLM architectures and effectively address various language processing tasks.

As the landscape of LLMs gets increasingly diverse, there have been different models under different license constraints. Driven by a desire to expand the range of available options and promote greater use cases of LLMs, latest movement has been focusing on introducing more permissive truly Open LLMs to cater both research and commercial interests, and several noteworthy examples include RedPajama, FastChat-T5, and Dolly.

Having closely observed the recent advancements, we are thrilled by not only the remarkable capabilities exhibited of these parameter-efficient models ranging from 3 billion to 7 billion in size, but also the exciting opportunity for end users to explore and leverage the power of personalized Open LLMs with fine-tuning at reasonable cost, making generative AI *accessible to everyone for a wide range of applications.

MLC LLM aims to help making Open LLMs accessible by making them possible and convenient to deploy on browsers, mobile devices, consumer-class GPUs and other platforms. It brings universal deployment of LLMs on AMD, NVIDIA, and Intel GPUs, Apple Silicon, iPhones, and Android phones.

This post describes our effort on streamlining the deployment of Open LLMs through a versatile machine learning compilation infrastructure. We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms. Furthermore, the workflow we have established can be easily adapted to support a wide range of models with fine-tuned (personalized) weights, promoting flexibility and customization in LLM deployment.

Universal Deployment of RedPajama​

RedPajama models exemplify how the open-source community can rapidly construct high-performing LLMs. RedPajama-3B is a small yet powerful model that brings the abilities for downstream users to fine-tune these models according to their specific needs, both aiming to empower individuals of diversified background to run Open LLMs with easy personalization. We love to support this same vision of bringing accessibility and personalization to fully realize potential of LLM technology within the broader community. As a result, we bring RedPajama support to a wide range of consumer devices with hardware acceleration.

RedPajama on Apple Silicon is achieved by compiling the LLM using Metal for M1/M2 GPUs (try out). Furthermore, MLC LLM provides a C API wrapper libmlc_llm.dylib that enables interaction with the generated Metal library. As an illustrative example, the command line tool mlc_chat_cli showcases the usage of libmlc_llm.dylib, which meanwhile also provides users with an interface to engage with RedPajama.
cli.gif
Similarly, RedPajama on consumer-class AMD/NVIDIA GPUs (try out) leverages TVM Unity’s Vulkan backend. The compilation process produces a corresponding wrapper library, libmlc_llm.so that encapsulates the generated SPIR-V/Vulkan code, and users may use mlc_chat_cli to chat with RedPajama. TVM Unity has CUDA, ROCm backends as well, and users have the choice to build alternative CUDA solutions themselves following the same workflow.
web.gif
Leveraging WebAssembly and WebGPU, MLC LLM allows RedPajama to be extended smoothly to web browsers (try out). TVM Unity compiles the LLM operators to WebGPU, and along with a lightweight WebAssembly runtime, a thin JavaScript driver llm_chat.js, RedPajama can be deployed as a static web page, harnessing clients’ own GPUs for local inference without a sever support.
ios.gif
RedPajama on iOS follows a similar approach to Apple Silicon, utilizing Metal as the code generation backend (try out). However, due to iOS restrictions, static libraries (e.g. libmlc_llm.a) are produced instead. To demonstrate the interaction with libmlc_llm.a, we provide an Objective-C++ file, LLMChat.mm, as a practical example, as well as a simple SwiftUI that runs the LLM end-to-end.

How​

Machine Learning Compilation (MLC) from TVM Unity plays a critical role in enabling efficient deployment and democratization of Open LLMs. With TVM Unity, several key features contribute to its effectiveness and accessibility:
  • Comprehensive code generation: TVM Unity supports code generation for a wide range of common CPU and GPU backends, including CUDA, ROCm, Vulkan, Metal, OpenCL, WebGPU, x86, ARM, etc. This expansive coverage allows for LLM deployment across diverse consumer environments, ensuring compatibility and performance.
  • Python-first development: MLC LLM compilation is developed in pure Python, thanks to the Python interface provided by TVM Unity, empowering developers to swiftly develop optimization techniques, compilation passes, and compose LLM building blocks. This approach facilitates rapid development and experimentation that allows us to quickly bring new model and backend support.
  • Built-in optimizations: TVM Unity incorporates a suite of built-in optimizations, such as operator fusion and loop tiling, which are keystones of high-quality code generation across multiple hardware platforms. These optimizations are used in MLC LLM, which can be used by ML engineers to amplify their daily workflow.
  • First-class support for vendor libraries and handcrafted kernels: TVM Unity treats handcrafted kernels, such as NVIDIA’s CUTLASS and cuBLAS libraries, as first-class citizens. This ensures seamless integration of the best-performing code, allowing developers to leverage specialized and optimized implementations when necessary.
  • Finally, a universal runtime that brings deployment to the programming language and platform of the developers’ choice.
compilation-workflow.svg
MLC LLM follows a streamlined compilation process:
  • LLM architecture definition: Users can choose from several built-in models, such as RedPajama, Vicuna, Llama, Dolly, or define their own models using a PyTorch-like syntax provided by TVM Unity.
  • ML compilation: MLC LLM uses TVM Unity’s quantization and optimization passes to compile high-level operators into GPU-friendly kernels that are natively compiled to consumer hardware.
  • Universal deployment: along with the compiled artifacts from the previous step, MLC LLM provides a convenient pack of the tokenizer and a lightweight runtime for easy deployment on all major platforms, including browsers, iOS, Android, Windows, macOS, and Linux.

Empowering Personalized Fine-Tuned Models​

Demand is strong to personalize LLMs, particularly, RedPajama, Vicuna/Llama, and therefore, empowering personalized models is a key feature as fine-tuned LLMs have been dominating the open-source community. MLC LLM allows convenient weight customization that user only needs to provide a directory in Huggingface format, it will produce proper model artifacts through exactly the same process.
customization.svg
MLC LLM’s chat applications (CLI, iOS, Web, Android) are specifically designed to seamlessly integrate personalized models. Developers can easily share a link to the model artifacts they have generated, enabling the chat apps to incorporate the personalized model weights.
ios-model-selector.jpeg
The iOS app allows users to download personalized weights of the same model on-demand via a link to model artifacts without re-compilation or redeployment. This streamlined approach makes it convenient for sharing model weight variants. The same model artifact can be consumed by other runtimes, such as WebApp, CLI and Android(incoming).

Please refer to our project page for a detailed guide on how to try out the MLC LLM deployment. The source code of MLC LLM is available on our official GitHub repository. You are also more than welcomed to join the Discord Channel for further discussion.

Ongoing Effort​

MLC LLM is a fairly young project and there are a lot of things to be done. As we start to streamline the overall project architecture and modularize the overall flow, we would love to focus on empowering developer communities. Our first priority is to bring documentation for our developers so they can build on top of our effort. We are actively working on documenting compilation of models with customized weights. Additionally, we are modularizing the overall libraries so it can be reused in other applications, including web, windows, macOS, linux, iOS and Android platforms. We are also expanding the prebuilt MLC pip development package on windows, linux and macOS, to simplify the experience for developers. At the same time, we are continuously working with the community to bring more model architectures. We will also bring more optimizations to continuously improve the memory and performance of the overall system.
 

Fresh

SOHH Vet
Joined
May 2, 2013
Messages
8,661
Reputation
4,915
Daps
20,826
I just watched the movie Ex Machina again. :francis:


basically every AI movie that's come out has been about it back firing on humans, so why are humans still pursuing all these goals of artificial intelligence ? I'm not talking about taking movies LITERALLY or anything, just the basic premise
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879
I just watched the movie Ex Machina again. :francis:


basically every AI movie that's come out has been about it back firing on humans, so why are humans still pursuing all these goals of artificial intelligence ? I'm not talking about taking movies LITERALLY or anything, just the basic premise

got to :manny:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879



About​

Determine the tokens that optimally represents a dataset at any specific vocabulary size

Given a text dataset, a vocabulary-size and a maximum-token-length, tokenmonster selects the tokens that optimally represent your dataset at that vocabulary size by brute force. It can do this at reasonable speed (within 24 hours) on server hardware, at a cost of around $8. Prebuilt vocabularies are provided, as well as tools & libraries for tokenization and detokenization using the prebuilt or your own vocabularies.

Test tokenmonster in your browser here.

tokenmonster is a novel approach to tokenization with broad-ranging use potential, but its primary motivation is to increase the inference speed and context-length of large language models by choosing better tokens. By selecting more optimal tokens, text can be represented with 20-30% less tokens compared to other modern tokenizing methods, increasing the speed of inference, training and the length of text by 20-30%. The code-optimized tokenizers do even better, see it for yourself.

I also believe that tokenmonster vocabularies will improve the comprehension of Large Language Models. For more details see How and Why.

Features​

  • Longer text generation at faster speed
  • Determines the optimal token combination for a greedy tokenizer (non-greedy support coming)
  • Successfully identifies common phrases and figures of speech
  • Works with all languages and formats, even binary
  • Works well with HTML tags, sequential spaces, tabs, etc. without wasting context
  • Does not require normalization or preprocessing of text
  • Averages > 5 characters per token
  • No GPU needed

Greedy vs. Non-greedy​

The current algorithm is a greedy algorithm (as are all other popular tokenization methods as far as I know). I have an idea for a non-greedy method that will add only 10% or so overhead to the tokenization process. I will test with this and if it's notably more efficient, I will replace the current greedy tokenizers with the ungreedy versions. All of that will be completed by the end of May.

Prebuilt Vocabularies​

The following vocabularies have already been built:

NameVocab SizeDataset SizeDataset Source
general-english65535840 MBarxiv + book + c4 + cc + github + stackexchange + wikipedia + reddit
general-english32000""
code65535263 MBgithub + stackexchange
code32000""
The training data mostly came from Red Pajamas 1B Token Sample. However, to reduce formal English and emphasize other languages, informal writing and code, I made the following modifications to the Red Pajamas sample: book_sample, c4_sample & cc_sample were cropped to 100MB, and Reddit conversations data were added (also cropped to 100MB.)

FilenameFilesize
arxiv_sample.txt88,925,569
book_sample.txt108,069,616
c4_sample.txt100,560,318
cc_2023-06_sample.txt100,852,231
github_sample.txt191,123,094
stackexchange_sample.txt71,940,138
wikipedia_sample.txt79,181,873
reddit.txt100,027,565
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879





About​

QLoRA: Efficient Finetuning of Quantized LLMs

arxiv.org/abs/2305.14314

QLoRA: Efficient Finetuning of Quantized LLMs​

| Paper | Adapter Weights | Demo |

This repo supports the paper "QLoRA: Efficient Finetuning of Quantized LLMs", an effort to democratize access to LLM research.

QLoRA uses bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. QLoRA was developed by members of the University of Washington's UW NLP group.

Overview​

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. We release all of our models and code, including CUDA kernels for 4-bit training.

License and Intended Use​

We release the resources associated with QLoRA finetuning in this repository under MIT license. In addition, we release the Guanaco model family for base LLaMA model sizes of 7B, 13B, 33B, and 65B. These models are intended for purposes in line with the LLaMA license and require access to the LLaMA models.

Demo​

Guanaco is a system purely intended for research purposes and could produce problematic outputs.

  1. Access the live demo here.
  2. Or host your own Guanaco gradio demo directly in Colab with this notebook. Works with free GPUs for 7B and 13B models.
  3. Alternatively, can you distinguish ChatGPT from Guanaco? Give it a try! You can access the model response Colab here comparing ChatGPT and Guanaco 65B on Vicuna prompts.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879



Meta AI Unleashes Megabyte, a Revolutionary Scalable Model Architecture​


Meta's research team unveils an innovative AI model architecture, capable of generating more than 1 million tokens across multiple formats and exceeding the capabilities of the existing Transformer architecture behind models like GPT-4.
2023-05-23_Meta_AI.jpeg


Meta's new proposed AI architecture could replace the popular Transformer models driving today's language models. Photo illustration: Artisana
🧠 Stay Ahead of the Curve
  • Meta AI researchers have proposed a groundbreaking architecture for AI decoder models, named the Megabyte model, capable of producing extensive content.
  • The Megabyte model addresses scalability issues in current models and performs calculations in parallel, boosting efficiency, and outperforming Transformers.
  • This innovation could instigate a new era in AI development, transcending the Transformer architecture and unlocking unprecedented capabilities in content generation.


By Michael Zhang

May 23, 2023


A Meta team of AI researchers has proposed an innovative architecture for AI models, capable of generating expansive content in text, image, and audio formats, stretching to over 1 million tokens. This groundbreaking proposal, if embraced, could pave the way for the next generation of proficient AI models, transcending the Transformer architecture that underpins models such as GPT-4 and Bard, and unleashing novel capacities in content generation.

The Constraints of Current Models​


Contemporary high-performing generative AI models, like OpenAI's GPT-4, are grounded in the Transformer architecture. Initially introduced by Google researchers in 2017, this architecture forms the backbone of emergent AI models, facilitating an understanding of nuanced inputs and generating extensive sentences and documents.

Nonetheless, Meta's AI research team posits that the prevailing Transformer architecture might be reaching its threshold. They highlight two significant flaws inherent in the design:
  1. With the increase in the length of inputs and outputs, self-attention scales dramatically. As each word processed or produced by a Transformer language model requires attention to all other words, the computation becomes highly intensive for thousands of words, whereas it's less problematic for smaller word counts.
  2. Feedforward networks, which aid language models in comprehending and processing words through a sequence of mathematical operations and transformations, struggle with scalability on a per-position basis. These networks operate on character groups or "positions" independently, leading to substantial computational expenses.

Megabyte Model: The Game Changer​


The Megabyte model, introduced by Meta AI, showcases a uniquely different architecture, dividing a sequence of inputs and outputs into "patches" rather than individual tokens. Within each patch, a local AI model generates results, while a global model manages and harmonizes the final output across all patches.

This methodology addresses the scalability challenges prevalent in today's AI models. The Megabyte model's patch system permits a single feedforward network to operate on a patch encompassing multiple tokens. Researchers found that this patch approach effectively counters the issue of self-attention scaling.

The patch model enables Megabyte to perform calculations in parallel, a stark contrast to traditional Transformers performing computations serially. Even when a base model has more parameters, this results in significant efficiencies. Experiments indicated that Megabyte, utilizing a 1.5B parameter model, could generate sequences 40% quicker than a Transformer model operating on 350M parameters.

Using several tests to determine the limits of this approach, researchers discovered that the Megabyte model's maximum capacity exceeded 1.2M tokens. For comparison, OpenAI's GPT-4 has a limit of 32,000 tokens, while Anthropic's Claude has a limit of 100,000 tokens.

Shaping the AI Future​


As the AI arms race progresses, AI model enhancements largely stem from training on an ever-growing number of parameters, which are the values learned during an AI model's training phase. While GPT-3.5 was trained on 175B parameters, there's speculation that the more capable GPT-4 was trained on 1 trillion parameters.

OpenAI CEO Sam Altman recently suggested a shift in strategy, confirming that the company is thinking beyond training colossal models and is zeroing in on other optimizations. He equated the future of AI models to iPhone chips, where the majority of consumers are oblivious to the raw technical specifications. Altman envisioned a similar future for AI, emphasizing the continual increase in capability.

Meta’s researchers believe their innovative architecture arrives at an opportune time, but also acknowledge there are other pathways to optimization. Promising research areas such as more efficient encoder models adopting patching techniques, decode models breaking down sequences into smaller blocks, and preprocessing sequences into compressed tokens are on the horizon, and could extend the capabilities of the existing Transformer architecture for a new generation of models

Nonetheless, Meta’s recent research has AI experts excited. Andrej Karpathy, the former Sr. Director of AI at Tesla and now a lead AI engineer at OpenAI, chimed in as well on the paper. This is “promising,” he wrote on Twitter. “Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long.”
 
Top