about.xethub.com
August 31, 2023
Comparing Code Llama Models Locally
Srini Kadamati
XetHub Blog | Comparing Code Llama Models Locally
Trying out new LLM’s can be cumbersome. Two of the biggest challenges are:
- Disk space: there are many different variants of each LLM and downloading all of them to your laptop or desktop can use up 500-1000 GB of disk space easily.
- No access to an NVIDIA GPU: most people don’t have an NVIDIA GPU lying around, but modern laptops (like the M1 and M2 MacBooks) have surprisingly good graphics capabilities.
In this post, we’ll showcase how you can stream individual model files on-demand (which helps reduce the burden on your disk space) and how you can use quantized models to run on your local machine’s graphics hardware (which helps with the 2nd challenge).
We wrote this post with owners of Apple Silicon pro computers in mind (e.g. M1 / M2 MacBook Pro or Mac Studio) but you can modify a single instruction (the
llama.cpp compilation instruction) to try on other platforms.
Before we dive in, we’re thankful for the work of
TheBloke (Tom Jobbins) for quantizing the models themselves, the
Llama.cpp community, and
Meta for making it possible to even try these models locally with just a few commands.
Llama 2 vs Code Llama
As a follow up to Llama 2, Meta recently released a specialized set of models named
Code Llama. These models have been trained on code specific datasets for better performance on coding assistance tasks. According to a
slew of benchmark measures, the Code Llama models perform better than just regular Llama 2:
Code Llama also was trained to provide stable generation with up to 100,000 tokens of context. This enables some pretty unique use cases.
- For example, you could feed a stack trace along with your entire code base into Code Llama to help you diagnose the error.
The Many Flavors of Code Llama
Code Llama has 3 main flavors of models:
- Code Llama (vanilla): fine-tuned from Llama 2 for language-agnostic coding tasks
- Code Llama - Python: further fine-tuned on 100B tokens of Python code
- Code Llama - Instruct: further fine-tuned to generate helpful (and safe) answers in natural language
For each of these models, different versions have been trained with varying levels of parameter counts to accommodate different computing & latency arrangements:
- 7 billion (or 7B for short): can be served on a single NVIDIA GPU (without quantization) and has lower latency
- 13 billion (or 13B for short): more accurate but a heavier GPU is needed
- 34 billion (or 34B for short): slower, higher performing, but has the highest GPU requirements
For example, the Code Llama - Python variant with 7 billion parameters is referenced as
Code-Llama-7b across this post and across the webs. Also, here's
Meta’s diagram comparing the model training approaches:
Model Quantization
To take advantage of XetHub’s ability to mount the model files to your local machine, they need to be hosted on XetHub. To run the models locally, we’ll be using the
XetHub mirror of the CodeLlama models quantized by
TheBloke (aka Tom Jobbins) . You'll notice that datasets added to XetHub also get deduplicated to reduce the repo size.
Tom has published models for each combination of
model type and
parameter count. For example, here’s the HF repo for
CodeLlama-7B-GGUF. You’ll notice that each model type has multiple quantization options:
The
CodeLlama-7B model alone has 10 different quantization variants. Generally speaking, the higher the bits (8 vs 2) used in the quantization process, the higher the memory needed (either standard RAM or GPU RAM), but the higher the quality.
GGML vs GGUF
The
llama.cpp community initially used the
.ggml file format to represent quantized model weights but they’ve since moved onto the
.gguf file format. There are a number of reasons and benefits of the switch, but 2 of the most important reasons include:
- Better future-proofing
- Support for non-llama models in llama.cpp like Falcon
- Better performance
Pre-requisites
In an earlier post, I cover how to run the Llama 2 models on your MacBook.
That postcovers the pre-reqs you need to run any ML model hosted on XetHub.
Follow steps 0 to 3 and then come back to this post. Also make sure you’ve
signed the license agreement from Meta and you aren’t violating their
community license.
Once you’re setup with PyXet, XetHub, and you’ve compiled
llama.cpp for your laptop, run the following command to mount the
XetHub/codellama repo to your local machine
:
xet mount --prefetch 32 xet://XetHub/codellama
This should finish in just a few seconds because all of the model files aren’t being downloaded to your machine. As a reminder, the XetHub for these models live
at this link.