1/1
Compared leaked Llama 3.1 benchmarks with other leading models, very excited for the release!
We can tier out models by price / 1M output tokens.
O($0.10): 4o-mini and <10B param models. I think 4o-mini will still be best but a strong local 8B will unlock lots of applications.
O($1): 70B class models, Haiku, 3.5-turbo. Distilled 70B looks like a category winner here! This is a nice price-point for lots of AI apps.
O($10): 405B, 4o, 3.5 Sonnet. Have to see how the post-training and harder benches go. Guess 3.5 sonnet is still the best, but 405B might give 4o real competition. This is just vibes, I like Sonnet's RLHF and hate the GPT RLHF.
Other takeaways:
- most benchmarks are saturated and probably not informative. OpenAI only reports harder benchmarks now, other developers should too (eg MATH > GSM8K)
- 405B doesn't look much better than distilled 70B, but harder benches and vibe tests will be better measurements than these tests
- 8B/70B distilled models are substantially better than when trained from scratch. I've wondered if for a given compute budget, it is better to overtrain a X param model or to train a X' (where X' >> X) and distill to X, maybe we will find out
- a lot of people thought that the original 8B saturated the params after 15T tokens. this is good evidence that it did not. softmax with high token count may have been why it did not quantize well. curious if the Llama 4 series will train in FP8 or BF16 -- logistically, serving 400B on 1x8H100 node seems much easier than 2x8H100 and it's much simpler to do this if the model was pretrained quantization-aware
- Gemma models do surprisingly well on MMLU relative to their other scores. most of the innovation in Gemma was supposed to be post-training, so curious if the 27B will hold up vs new 8B-Instruct
- Mistral Nemo 12B and 3.1 8B look somewhat similar, but I'd guess most developers will stick to Llama's better tooling and smaller param count. tough timing
- I am fairly sure that 3.1 was not trained early fusion, and somebody's going to throw up a Llava finetune in 2-3 days.
- personal guess (using other info) is that 405B-Instruct will fall short of Sonnet / 4o. but man, what a good model to have open source, and the gap is closing
- llama3.1405 looks like a Pi joke
all models are base except 4o-class, took the best available score from different repos and averaged MMLU for Llama. all benchmarks are wrong but hopefully useful for an overall picture.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196