The A.I Megathread (LLM , GPT , Development)

bnew · Sep 5, 2024

1/11
After a recent price reduction by OpenAI, GPT-4o tokens now cost $4 per million tokens (using a blended rate that assumes 80% input and 20% output tokens). GPT-4 cost $36 per million tokens at its initial release in March 2023. This price reduction over 17 months corresponds to about a 79% drop in price per year. (4/36 = (1 - p)^{17/12})

As you can see, token prices are falling rapidly! One force that’s driving prices down is the release of open weights models such as Llama 3.1. If API providers, including startups Anyscale, Fireworks, Together AI, and some large cloud companies, do not have to worry about recouping the cost of developing a model, they can compete directly on price and a few other factors such as speed.

Further, hardware innovations by companies such as Groq (a leading player in fast token generation), Samba Nova (which serves Llama 3.1 405B tokens at an impressive 114 tokens per second), and wafer-scale computation startup Cerebras (which just announced a new offering this week), as well as the semiconductor giants NVIDIA, AMD, Intel, and Qualcomm, will drive further price cuts.

When building applications, I find it useful to design to where the technology is going rather than only where it has been. Based on the technology roadmaps of multiple software and hardware companies — which include improved semiconductors, smaller models, and algorithmic innovation in inference architectures — I’m confident that token prices will continue to fall rapidly.

This means that even if you build an agentic workload that isn’t entirely economical, falling token prices might make it economical at some point. As I wrote previously, being able to process many tokens is particularly important for agentic workloads, which must call a model many times before generating a result. Further, even agentic workloads are already quite affordable for many applications. Let's say you build an application to assist a human worker, and it uses 100 tokens per second continuously: At $4/million tokens, you'd be spending only $1.44/hour – which is significantly lower than the minimum wage in the U.S. and many other countries.

So how can AI companies prepare?
- First, I continue to hear from teams that are surprised to find out how cheap LLM usage is when they actually work through cost calculations. For many applications, it isn’t worth too much effort to optimize the cost. So first and foremost, I advise teams to focus on building a useful application rather than on optimizing LLM costs.
- Second, even if an application is marginally too expensive to run today, it may be worth deploying in anticipation of lower prices.
- Finally, as new models get released, it might be worthwhile to periodically examine an application to decide whether to switch to a new model either from the same provider (such as switching from GPT-4 to the latest GPT-4o-2024-08-06) or a different provider, to take advantage of falling prices and/or increased capabilities.

Because multiple providers now host Llama 3.1 and other open-weight models, if you use one of these models, it might be possible to switch between providers without too much testing (though implementation details — specifically quantization, does mean that different offerings of the model do differ in performance). When switching between models, unfortunately, a major barrier is still the difficulty of implementing evals, so carrying out regression testing to make sure your application will still perform after you swap in a new model can be challenging. However, as the science of carrying out evals improves, I’m optimistic that this will become easier.

[Original text (with links): AI Restores ALS Patient's Voice, AI Lobby Grows, and more ]

2/11
Why are we considering 4 and 4o to be the same tokens though, if they arent..

3/11
Let's hope SB-1047 proponents realize that open-source is already vital for customers and to avoid price gouging.

4/11
OpenAI gets a lot of

for announcing but not releasing their innovation - which is most likely not their fault BTW. But, they have given us GPT-4O-mini with amazing price-performance. I'm not sure most people realize how awesome it is!

5/11
As AI models become more affordable, it's a great time to explore new possibilities and build innovative applications without worrying too much about costs.

6/11
Prices will continue to go down. LLMs will rapidly become commodities. The value will be created at the application level.
Why Large Language Models Are A Commodity Now And What It Means For The AI Space

7/11
The key lesson from your post is to work on the application of LLM on a use case and, for the time being, accept the high cost

8/11
Mark Zuckerberg is the best.

9/11
the most important factor is the size of the model really declined. GPT 4 is 1800B MoE, while GPT 4o maybe 100B I guess. The second is the inference optimization, various means like quantization, batching and cache.The hardware price doesn't decline that fast.

10/11
With easier access to models, there might be a push for more transparency in how decisions are made, affecting both model selection and application design

11/11
Another confirmation that using @LangChainAI , although a pretty krufty library, is a good move

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/3
New SAEs for Llama 3.1 8B, now with twice as many latents. We trained them using the MultiTopK loss, which enables you to choose the degree of sparsity you want at inference time.

Preliminary analysis suggests they are more interpretable than the 32x. EleutherAI/sae-llama-3.1-8b-64x · Hugging Face

2/3
Will be adding residual stream for layers 23 and 29 in a couple days

3/3
we ran our autointerp pipeline and measured how good the explanations were using our fuzzing and recall evals

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/8
Generative Verifiers: Reward Modeling as Next-Token Prediction

abs: [2408.15240] Generative Verifiers: Reward Modeling as Next-Token Prediction

New paper from Google DeepMind; Instead of training the reward model as a discriminative classifier, train it with next token prediction:

Ask model "Is the answer correct (Yes/No)?", reward score is token probability for "Yes" token.

Naturally supports CoT and majority voting by generating several verbalized rationales before predicting correctness

The approach, referred to as GenRM, outperforms discriminative verifiers and LLM-as-a-Judge, showing a 16−64% improvement in the percentage of problems solved with Best-of-N on algorithmic string manipulation and math reasoning tasks.

2/8
@memdotai mem it

3/8
Saved! Here's the compiled thread: https://mem.ai/p/KtZUUHsbMI9AqOWGtCdI

4/8
The paper introduces "Generative Verifiers" (GenRM), a new approach to training verifiers or reward models for large language models (LLMs). Traditionally, verifiers are trained as discriminative models to score the correctness of LLM-generated solutions. However, this approach does not fully utilize the text generation capabilities of LLMs.

The key idea behind GenRM is that by recasting verification as a next-token prediction problem, LLMs can better leverage their text generation capabilities. This allows for seamless integration with other LLM techniques like CoT reasoning and majority voting, leading to substantial performance gains.

full paper: Generative Verifiers: Reward Modeling as Next-Token Prediction

5/8
Fascinating! This GenRM approach could revolutionize AI verification. I'm excited to see how it integrates with instruction tuning and chain-of-thought reasoning. The potential for breakthroughs in math reasoning tasks is huge!

6/8
Detailed thread here:

7/8
The only theory where things work for a reward is substance abuse theory where humans are supposed to score drugs to satisfy their "reward center".

Implementing it in AI is like creating a junk that craves for the right answer.

Make it behave, so sad ...

@alexgraveley @ylecun

8/8
Generative Verifiers: Reward Modeling as Next-Token Prediction

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/10
Hermes 405b is hilarious.

It often acts like it just woke up in the middle of the madness and screams things like What the hell is going on

2/10
it often acts shocked and horrified but also entranced by Opusian extreme ,uh, manifestations, and its flustered reactions are very funny

3/10
have a good friend like this; he often looks like he’s walking into a room for the first time ever … but its his own living room. he’s a total genius too

4/10
hermes?

5/10

6/10
How can a humble layman set up some discord bots like these?

7/10
LLMs will end up writing really exciting novels & screenplays.

8/10
*snortlaff*

Ernie from Sesame Street "HErmeeeees"

9/10
"oh god oh fukk" cracked me up.

10/10
only sane man

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
Salesforce presents xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

- Competitive perf among open-source models
- Open-sources models, curated large-scale datasets, and fine-tuning codebase

hf: XGen-MM-1 models and datasets - a Salesforce Collection
abs: [2408.08872] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/7
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

Significant improvements + achieving new SotA on:

- high school level miniF2F bench (63.5%)
- undergraduate level ProofNet bench (25.3%)

[2408.08152] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

2/7
Harmonic says they have 90% on MiniF2F
Harmonic - News

3/7
The researchers attribute the performance gains to the combination of techniques employed, including supervised fine-tuning, reinforcement learning, and the RMaxTS algorithm. The incorporation of natural language reasoning and tactic state information is shown to be crucial in enhancing the model's planning and execution of formal proof writing.

full paper: DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

4/7
congratulations

5/7
GRPO has become mainstream in open-weight math-specialized models

6/7
Congrats on the new SotA results! I'm excited to see how DeepSeek-Prover-V1.5 will further push the boundaries of theorem proving in Lean 4.

7/7
有意思的研究，稍后拜读

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Google just released the Imagen 3 paper!

[2408.07009] Imagen 3

2/4
The results showed that Imagen 3 outperforms other state-of-the-art models in overall preference, prompt-image alignment, and numerical reasoning tasks. Imagen 3 also performed well on detailed prompt-image alignment, particularly on longer and more complex prompts. However, on visual appeal, Midjourney v6 was found to be the leading model.

full paper: Imagen 3

3/4
Innovation and technology are transforming the world at an incredible speed! Whether it's AI that shows us surprising versions of ourselves or tools that simplify our lives, we're living through a revolution.

4/4
most is about evaluation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/5
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Outperforms the SotA method on performing long-horizon tasks in Minecraft and narrows the gap toward human-level performance

proj: Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
abs: [2408.03615] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

2/5
Do u have any reference to transformer architecture for continuous variables?

3/5
AI Summary: The paper introduces Optimus-1, a multimodal agent designed to excel in long-horizon tasks in open-world environments, particularly in Minecraft. It features a Hybrid Multimodal Memory module tha...
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

4/5
[QA] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

5/5
GitHub - HemanthIITJ/Deep_learning: Holistic understanding of Large Language Models (LLMs) involves integrating NLP, computer vision, audio processing, and reinforcement learning. GNNs capture intricate data relationships. Attention mechanisms, Transformer architectures, vision-language pre-training, audio processing with spectrograms, pre-trained embeddings, and reinforcement .

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Meta presents UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

- Scaling offers little benefit for reasoning or relations
- Best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST

repo: GitHub - facebookresearch/unibench: Python Library to evaluate VLM models' robustness across diverse benchmarks
abs: [2408.04810] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

2/4
1. Scaling training data or model size offers little benefit for reasoning and relational tasks compared to other benchmark types.
2. Even the best VLMs struggle on simple digit recognition and counting tasks, which can be easily solved by much smaller models.
3. Data quality matters more than data quantity for improving performance, especially on relational and reasoning tasks.
4. Tailored learning objectives can help overcome the limitations of scaling for certain capabilities.

full paper: UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

3/4
@giffmana new bench just drop

4/4
Surprised that Yann LeCun isn’t part of the authors

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Meta presents MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

- Dividesy expert modules into modality-specific groups
- Achieves better performance than the baseline MoE

abs: [2407.21770] MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
alphaxiv: MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts | alphaXiv

2/4
Dark mode for this paper for those who read at night

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

3/4
Impressive research Prof. kindly go through this [International Journal of Computer Science and Mobile Applications Journal] submit your research interest related /search?q=#computerscience /search?q=#robotics /search?q=#machinelearning /search?q=#technology DM or give contact details so we can connect for collaboration.

4/4
Glad to see more MoD papers!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Google presents ShieldGemma: Generative AI Content Moderation Based on Gemma

- Opensources Gemma2-based content moderation models
- Outperform Llama Guard (+10.8% AU-PRC on public benchmarks) and WildCard (+4.3%)

abs: [2407.21772] ShieldGemma: Generative AI Content Moderation Based on Gemma
alphaxiv: ShieldGemma: Generative AI Content Moderation Based on Gemma | alphaXiv

2/4
AI Summary: ShieldGemma is a suite of LLM-based content moderation models developed by Google LLC, designed to predict safety risks associated with various harm types such as hate speech and harassment in bo...
ShieldGemma: Generative AI Content Moderation Based on Gemma

3/4
if you don't like the impact of woke AI, you have to release your own moderators @elonmusk

4/4
Even some human moderators do more harm than good and they're open sourcing the one based on the weaker freemium model?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Google presents Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts

[2407.19985] Mixture of Nested Experts: Adaptive Processing of Visual Tokens

2/4
AI Summary: The paper introduces the Mixture of Nested Experts (MoNE) framework, which addresses the inefficiencies in processing visual tokens within Vision Transformer (ViT) models by leveraging inherent d...
Mixture of Nested Experts: Adaptive Processing of Visual Tokens

3/4
Dark mode for this paper

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

4/4
thinking fast and slow: moe edition

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
Nvidia presents MambaVision

- New SOTA Pareto-front in terms of Top-1 accuracy and throughput
- code and checkpoints available

GitHub - NVlabs/MambaVision: Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
Small Molecule Optimization with Large Language Models

- Presents LMs fine-tuned on 110M molecules with computed properties, totaling 40B tokens
- SotA on multiple molecular optimization benchmarks

repo: GitHub - YerevaNN/ChemLactica: Fine-tuning Galactica and Gemma to operate on SMILES. Integrates into a molecular optimization algorithm.
abs: [2407.18897] Small Molecule Optimization with Large Language Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
The arxiv paper for Llama 3 was just released. Start adding comments on the alphaXiv page to discuss Llama 3!

The Llama 3 Herd of Models | alphaXiv

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran