REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
New Paper and Blog!

Sakana AI

As LLMs become better at generating hypotheses and code, a fascinating possibility emerges: using AI to advance AI itself! As a first step, we got LLMs to discover better algorithms for training LLMs that align with human preferences.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
Can LLMs invent better ways to train LLMs?

At Sakana AI, we’re pioneering AI-driven methods to automate AI research and discovery. We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM!

Sakana AI

Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!

Paper: [2406.08414] Discovering Preference Optimization Algorithms with and for Large Language Models
GitHub: GitHub - SakanaAI/DiscoPOP: Code for Discovering Preference Optimization Algorithms with and for Large Language Models
Model: SakanaAI/DiscoPOP-zephyr-7b-gemma · Hugging Face

We proudly collaborated with the @UniOfOxford (@FLAIR_Ox) and @Cambridge_Uni (@MihaelaVDS) on this groundbreaking project. Looking ahead, we envision a future where AI-driven research reduces the need for extensive human intervention and computational resources. This will accelerate scientific discoveries and innovation, pushing the boundaries of what AI can achieve.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867





1/11
This paper seems very interesting: say you train an LLM to play chess using only transcripts of games of players up to 1000 elo. Is it possible that the model plays better than 1000 elo? (i.e. "transcends" the training data performance?). It seems you get something from nothing, and some information theory arguments that this should be impossible were discussed in conversations I had in the past. But this paper shows this can happen: training on 1000 elo game transcripts and getting an LLM that plays at 1500! Further the authors connect to a clean theoretical framework for why: it's ensembling weak learners, where you get "something from nothing" by averaging the independent mistakes of multiple models. The paper argued that you need enough data diversity and careful temperature sampling for the transcendence to occur. I had been thinking along the same lines but didn't think of using chess as a clean measurable way to scientifically measure this. Fantastic work that I'll read I'll more depth.

2/11
[2406.11741v1] Transcendence: Generative Models Can Outperform The Experts That Train Them paper is here. @ShamKakade6 @nsaphra please tell me if I have any misconceptions.

3/11
In the classic "Human Decisions and Machine Predictions" paper Kleinberg et al. give evidence that a predictor learned from the bail decisions of multiple judges does better than the judges themselves, calling it a wisdom of the crowd effect. This could be a similar phenomena

4/11
Yes that is what the authors formalize. Only works when there is diversity in the weak learners ie they make different types of mistakes independently.

5/11
It seems very straightforward: a 1000 ELO player makes good and bad moves that average to 1000. A learning process is a max of the quality of moves, so you should get a higher than 1000 rating. I wonder if the AI is more consistent in making "1500 ELO" moves than players.

6/11
Any argument that says it's not surprising must also explain why it didn't happen at 1500 elo training, or why it doesn't happen at higher temperatures.

7/11
The idea might be easier to understand for something that’s more of a motor skill like archery. Imagine building a dataset of humans shooting arrows at targets and then imitating only the examples where they hit the targets.

8/11
Yes but they never have added information on what a better move is or who won, as far as I understood. Unclear if the LLM is even trying to win.

9/11
Interesting - is majority vote by a group of weak learners a form of “verification” as I describe in this thread?

10/11
I don't think it's verification, ie they didn't use signal of who won in each game. It's clear you can use that to filter only better (winning) player transcripts , train on that, iterate to get stronger elo transcripts and repeat. But this phenomenon is different, I think. It's ensembling weak learners. The cleanest setting to understand ensembling: Imagine if I have 1000 binary classifiers, each correct with 60 percent probability and *Independent*. If I make a new classifier by taking majority, it will perform much better than 60 percent. It's concentration of measure, the key tool for information theory too. The surprising experimental findings are 1. this happens with elo 1000 chess players where I wouldn't think they make independent mistakes. 2. Training on transcripts seems to behave like averaging weak learners.

11/11
Interesting


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQaWL4zbgAANwp0.png

GQcLT1cXoAAwYmG.jpg

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867



1/11
powerful, fast, or safe? pick three.

2/11
Alright that was pretty sweet

3/11
let's, as they say, go

4/11
damn I might switch to it full time, it has vision too right?

5/11
SOTA:

6/11
the demos have you written all over them ❣️ love how much fun yall are clearly having

7/11
couldn't collab more beautifully than with the inimitable @whitneychn

8/11
Nice chart. Competitive markets truly accelerates innovation!

9/11
What's up, @sammcallister ? Can I dm you and show you how I got Claude 3.5 Sonnet to 1 shot solve word problems that it previously couldn't?

10/11
great graphic

11/11
I take the one with new features to test 👀 Claude 3.5 fits there as well! Loads of small nice details, like code revisions over here 🔥


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQhbBWhXoAEtUfF.png

GQhW2Y6WgAABTGV.png

GQhlFfNWMAA_qpR.jpg

GQiJVeFWIAASaQL.jpg

GQhpe6PXAAEuooI.jpg

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/9
Unpopular opinion: We will not achieve AGI any time soon and @leopoldasch's prediction is way off.

Here's why:

1 - The Straight Line Fallacy
One of the most common mistakes in predicting technological advancements is assuming that progress will continue in a straight line. This belief doesn’t account for the possibility of an "AI winter," a period where advancements slow down or stop due to unexpected technical challenges or diminishing returns.

History shows us that progress doesn’t happen in straight lines. Look at the transition from horses to cars. It wasn’t a smooth, continuous progression. There were bursts of rapid development followed by long periods where nothing much happened. The same can happen with AI. In physics, we've seen significant slowdowns despite continuous efforts.

Similarly, in medicine, the fight against cancer has not progressed in a straight line; breakthroughs are followed by long periods of incremental progress or even setbacks.

2 - The AGI Definition Problem
First, we don’t even have a clear definition of AGI.

Leopold’s statements highlight this confusion: On one hand, he says it’s "strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer." On the other hand, he claims, "We are on course for AGI by 2027."

These are vastly different capabilities. The lack of a clear definition muddles the discussion.

3 - The Compute Problem
Even if we assume that all we need to achieve AGI is more compute power, we face significant hurdles. The production of GPUs and other essential components is already hitting resource limits.

The environmental impact of the massive energy consumption required for training these models could lead to regulatory restrictions. So, even if more compute is all we need, who’s to say we’ll have enough of it?

4 - The Limits of Current AI
@ylecun often points out that current LLMs, despite their impressive capabilities, are fundamentally limited.

They don’t understand the real world, they often generate incorrect information, and they can’t plan or reason beyond their training data. These are not minor issues—they’re fundamental gaps that more data and compute power alone won’t fix.

Final Point: The Influence of Corporate Hype
There’s another, less talked about aspect of the AGI debate: corporate hype. The talk of AGI has been hijacked by corporations looking to benefit from millions of $ of free marketing the term generates.

Don't fall for the hype.
---
Let me be clear that this is not an attack on Leopold (awesome guy) but my personal opinion on the subject.

2/9
Is there a plausible path to AI becoming smarter than humans while only being able to train on information generated by humans or derivatives synthesized from that information?

3/9
My problem with that graph is assumption that ChatGPT-4.0 is a "smart high schooler". It's hilarious knowing all the scams that had to be manually added to the training data so that chatgpt wouldn't get fooled.

4/9
Yann Le Cun has also been wrong in the past about the capabilities of GPT-n+1 models, particularly regarding the understanding of the physical world or certain reasoning/planning capacities.

What guarantees us that larger models cannot solve some of its problems ?

5/9
> 2 - The AGI Definition Problem

I think the fuzziness of the AGI definition also is the main argument in favour of his viewpoint.
But yes, this fuzziness makes statements like this somewhat pointless.

6/9
The people who don't know what consciousness is predicting conscious computers is the biggest scientific joke of 21st century.
Second biggest is people believing time-travel is possible.

7/9
It's not even a straight line. It's a log plot, so the line looks straight, but it's really exponential. Expecting the exponential growth to continue indefinitely is just silly.

8/9
Number one should already be enough, but people love fooling themselves. It will not be a straight line, period

9/9
There’s a fundamental underlying issue here: we cannot define or effectively measure human intelligence. We can’t even define human consciousness. Thinking we can invent it in a couple decades is a staggering overestimation of our own understanding of ourselves.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GPQFkaUaQAE08JB.jpg




1/10
AGI by 2027 is strikingly plausible.

That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.

2/10
From my new series. I go through rapid scaleups in compute, consistent trends in algorithmic progress, and straightforward ways models can be "unhobbled" (chatbot -> agent) on the path to a "drop-in remote worker" by 2027.
I. From GPT-4 to AGI: Counting the OOMs - SITUATIONAL AWARENESS

3/10
But what does AGI mean? It has lost its meaning in the past two or three years completely

4/10
> straight lines
> exponential chart

My guy

5/10
Straight lines on a log graph

6/10
Most plausible:

7/10


8/10
"...it just requires believing in straight lines on a graph."

9/10
.. It requires believing GPT-4 is as smart as a smart high schooler which is an absolute fantasy.

10/10
Believing in straight lines on graphs in log scale is sci-fi or for bitcoin believers


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GPQFkaUaQAE08JB.jpg

GPQOMf7bcAAeNrj.png

GPRm13naAAA3plE.png

F4JrE3UWcAAaY8J.jpg

F4JrGcHXsAAGGpQ.jpg

GPE6EK6WgAAxc5P.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

[2407.04153] Mixture of A Million Experts
[Submitted on 4 Jul 2024]

Mixture of A Million Experts​

Xu Owen He
Abstract:
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.


Subjects:Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:arXiv:2407.04153
arXiv:2407.04153v1

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.04153




A.I Generated explanation:


This is a good start! Here's a slightly more detailed and technically accurate explanation:

Title: **Efficiently using a large number of experts in a Transformer model**

This paper describes a new technique for making Transformer models more efficient.



Here's the problem in more detail:


  • * Transformers are powerful AI models, especially for understanding and generating language. They do this by analyzing the relationships between words in a sentence, no matter where they are in the sentence. This is great for understanding complex language, but the traditional way of building Transformers means they need a lot of computing power to work well.

  • * The researchers are trying to address this by creating a more efficient way to use the "experts" (think of them as specialized modules within the model) to improve performance without increasing the size of the model itself.


The problem the researchers are tackling:


* **Computational cost:**

Training large, powerful AI models is expensive, requiring a lot of computational resources and time.

* **Efficiency:**

The researchers want to make these models more efficient without sacrificing their ability to learn.


The solution they propose:


* PEER (Parameter-Efficient Expert Retrieval) is a method that addresses the scaling problem of Transformers by using a technique called "sparse mixture-of-experts". This means instead of having one large model, they use many smaller models, each specializing in a specific aspect of the language.

* Think of it like this: Imagine you have a team of experts, each with a limited area of expertise. Instead of letting them all work on every problem, which would be inefficient, you only select the experts who are relevant to the task at hand. This is what the researchers are aiming for with this new method.

The key is to use "expert routing"

* The researchers want to use a large number of experts (smaller models) to make the model more powerful, but they also want to make sure the model is still efficient.

* The paper proposes a way to make the model more powerful by increasing the number of experts, but it's not clear from this excerpt how they achieve that efficiency.


Possible ways to achieve efficiency in PEER:


* Sparsely activating only a subset of experts:

This means that not all of the experts are used for every input, only the ones that are most relevant.

* Efficient routing mechanisms:

The paper likely proposes a specific method for determining which experts are activated for a given input.

* Efficient training techniques:

The excerpt mentions the paper will likely discuss this, but it's not clear what specific techniques are used.

To understand the full solution, you'd need to read the paper, but the key takeaway is that they're proposing a way to improve the efficiency of AI models by making them more modular and scalable.







1/1
The Mixture of a Million Experts paper is a straight banger.

Reduces inference cost and memory usage, scales to millions of experts, oh and just happens to overcome catastrophic forgetting and enable life long learning for the model.

Previous MOE models never got past 10k experts and they had a static router to connect them up that was inefficient but this includes a learned router than can handle millions of micro experts. Reminds me a bit of how the neocortex works because it is composed of about 2 million cortical columns that can each learn a model of the world and then work together to form a collective picture of reality.

Catastrophic forgetting and continual learning are two of the most important and nasty problems with current architectures and this approach just potentially wiped out both in one shot.

There have been other approaches to try to enable continual learning and overcome catastrophic forgetting like bi-level continual learning or progress and compress, that use elastic weight consolidation, knowledge distillation and two models, a big neural net and a small learning net. The small net learns and over time the learnings are passed back to the big net. Its weights are partially frozen and consolidated as the new knowledge is brought in. Good ideas, also out of Deep Mind robotics teams.

But this paper seems to say you can just add in new mini experts, freeze or partially freeze old weights, and just grow the model's understanding as much as you want, without causing it to lose what it already knows.

It's like having Loras built right into the model itself.

[2407.04153] Mixture of A Million Experts


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GSjq38yXcAAeUbd.jpg
 

Arizax2

All Star
Joined
Mar 20, 2017
Messages
3,396
Reputation
333
Daps
10,879
Whole lot of unemployed folks in the near future. But we have Biden and Trump in their 80s as our choice when this AI shyt is about to explode in the next 3-4 years. I'm telling you now that these two probably don't even know the difference between a Mac and PC OS. :beli:
 
Top