bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


1/10
@omarsar0
Thinking LLMs

How difficult is it to train LLMs to do explicit "thinking" before responding to questions or tasks?

This work proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data.

It uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision.

Thought candidates for each user instruction are scored with a judge model. Note that only the responses are evaluated by the Judge which determines the best and worst ones.

Then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). This entails the full training process that involves multiple iterations.

Overall, this is a simple yet very effective approach to incentivizing the model to generate its own thoughts without explicitly teaching it how to think. The authors also find that these Thinking LLMs are effective even in problems that often don't rely on reasoning or CoT methods.



GZ8eZQ8b0AIjtkp.jpg


2/10
@omarsar0
Paper: [2410.10630] Thinking LLMs: General Instruction Following with Thought Generation



3/10
@ninataft
AI's new thinking cap: Now with less coffee, more thought bubbles, and zero existential crises! 🧠😆



4/10
@BensenHsu
The paper focuses on training large language models (LLMs) to have the ability to "think" before providing a response, similar to how humans think before answering a complex question. This is called "Thinking LLMs". The authors argue that thinking can be beneficial for a wide range of tasks, not just logical or math-related ones.

The results show that the initial seed model performs poorly when asked to generate thoughts, compared to the direct response baseline. However, after multiple iterations of TPO training, the Thinking LLM outperforms the direct response baseline on general instruction-following benchmarks like AlpacaEval and Arena-Hard. The authors also find that thinking benefits not only reasoning and problem-solving tasks, but also categories like marketing, health, and general knowledge.

full paper: THINKING LLMS: GENERAL INSTRUCTION FOLLOWING WITH THOUGHT GENERATION



GZ8vii0aYAALKlu.jpg


5/10
@LaurenceBrem
From the abstract, it still feels like inference is happening in some form before responding. Still impressive!



6/10
@bytetweets
It’s not thinking, it is reasoning.



7/10
@Anonymo74027124
Interesting concept. Curious to see the results of this training method. Will keep an eye on its progress.



8/10
@jackshiels
It’s actually very interesting how so many researchers missed this. The need for compute probably left many with an implicit assumption that ‘compute is bad’, meaning it was overlooked.



9/10
@bate5a55
Interesting that the 'Judge' component evaluates only the responses, allowing the LLM to generate unfiltered thoughts. This might enable emergent metacognition, a capability rarely seen in AI models.



10/10
@gpt_biz
This sounds like an exciting development in AI research! It's impressive to see models being trained to think more independently without needing human annotations. Looking forward to the future possibilities!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


1/9
@omarsar0
This is the first report I've seen where out-of-the-box o1 CoT reasoning is significantly outperformed on code generation.

They propose AlphaCodium which acts as a strategy provider (i.e., flow engineering) on top of o1 to fuel and guide o1's chain-of-thought capabilities.

AlphaCodium is built as a multi-stage flow improving on reasoning and reliability, while significantly elevating o1-preview's performance (55% --> 78% pass@5 accuracy on the Codeforces benchmark).

Author's quote: "The promise of AlphaCodium is clear: with the right strategic flow-engineering, foundational models like o1 can be nudged toward System II thinking. We still have work to do to cross the gap from “System 1.5” to a true System 2-level AI, but when observing tools like AlphaCodium, we can better understand the gap and keep researching to get closer."

o1 is a significantly better model than most LLMs out there today. However, it can benefit from strategic guidance as shown in this report. I am also working on a similar approach for knowledge-intensive tasks. From my own analysis, it does feel like o1 has better knowledge of complex tasks but still shows limitations to complex knowledge understanding and reasoning. More on this soon.



GZ3Ysikb0AIXfPf.jpg


2/9
@omarsar0
Details here: AlphaCodium Outperforms Direct Prompting of o1 Model



3/9
@kubertai
Hmm, I'm testing it with a simple use case to fix a unit test, and it keeps giving incorrect suggestions. Here, the suggestion is incorrect, breaks the working code and does not update the test. Maybe it's a user error. I will give it some more time.



GZ7NLTFWMAA5GTD.jpg


4/9
@itamar_mar
I've recorded 5min explanation:
x.com

Great summary @omarsar0 , thank you ♥️

[Quoted tweet]
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/


5/9
@TaNGSoFT
As I said after watching Chollet's speech, which can be regarded as clearing my doubts about the boundary of current LLM intelligence:

The attention mechanism realized by the Transformer architecture is the process of the digital neural network for the text. The abstraction he said here is similar to the “compress” said by Ilya. This process is value-centric. The continuous abstraction of value centric is suitable for intention understanding, knowledge retrieve, and intuitive persuasion. It is especially suitable for conversational applications, so the abstraction of next word prediction is fine;

but reasoning needs the progra of type2. The abstraction of m-centric is the abstraction of discrete thinking. It seems that another layer of logical structure needs to be built on the text. This structure is very certain - so the word program is used. And the attention of the transformer architecture alone is not enough, and a different breakthrough is needed.

While here, Alphacodium did things like the program structure.



6/9
@feulf
not a fan of alphacodium, I've tried it before, it's pretty much a copy of cursor, but it asks access to your full codebase, where (according the founders) cursor only access your codebase locally and pushes the embeddings + a merkle tree of the changes to cache it.



7/9
@TheLobbyistGuy
It adds in a planning phase or two before generating the code.

Anyone who has used LLMs to do much coding knows that this improves performance a great deal.



8/9
@bate5a55
Interesting to see AlphaCodium's "Problem Reflection" phase—an uncommon feature in AI models. By integrating self-reflection, it's mimicking expert programmers' introspection, which might explain its edge over o1's CoT.



9/9
@gpt_biz
This is exciting! AlphaCodium sounds like a game changer in improving o1's reasoning and performance




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/12
@itamar_mar
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/



https://video.twimg.com/ext_tw_video/1846101570376630272/pu/vid/avc1/1152x720/89VrSUNLE2cOxwYl.mp4

2/12
@itamar_mar
2/
‣System 1 – quick inference
‣System 1.5 – guided chain-of-thought
‣System 2 – deep, deliberate reasoning reinforced by information arriving from validation processes, using & acquiring relevant thinking frameworks and tools, including devising & choosing between options



3/12
@itamar_mar
3/ AlphaCodium is an open-source research tool that boosts the performance of various leading models, including o1-preview

It employs a code-oriented, multi-stage flow that emphasizes continuous improvement through iteration

Designed by the team at @QodoAI



GZ4OwzYWEAAsFmy.jpg

GZ4PClBWgAA7z5c.jpg


4/12
@itamar_mar
4/ If o1 was already exhibiting System 2 thinking, I claim it would have gained less from being wrapped up with AlphaCodium

More in the video above and in the blog: AlphaCodium Outperforms Direct Prompting of o1 Model

By the way - does System II thinking mean AGI?



5/12
@beyondthisjourn
Sam alman said that O1 is only at gpt-2 level of reasoning. Till we get to gpt-4 level its not yet system 2



6/12
@itamar_mar
Yea, saw that.
Makes sense to me.

Here is another similar quote by him

[Quoted tweet]
here is o1, a series of our most capable and aligned models yet:

openai.com/index/learning-to…

o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.


GXStyAIW8AEQDka.jpg


7/12
@DavidBensh
We don't need more nomenclature 😅
Neuroscience actually distinguishes system 2 vs 1 mainly as Semantic (and therefore conscious) vs Associative (therefore automatic and unconcious).
I want to see either semantic references or consciousness before this question becomes relevant.



8/12
@itamar_mar
Interesting.

How would a conscious model/AI-system look like?



9/12
@SiVola
Congrats! Looks very promising. Can we use it in Cursor?



10/12
@itamar_mar
AlphaCodium is a research tool introduced by @QodoAI.

At Qodo, we develop various tools, including Qodo Gen which is an IDE extension that can be installed in Cursor.
However, AlphaCodium itself isn't directly implemented in Qodo Gen, rather the learnings are continuously being integrated.

In the future, AlphaCodium will become a flow that will work for real-world software (and then we aim to call it Qodo Flow; the UX/UI is still not disclosed)



11/12
@roybenjos
Amazing! Good luck!



12/12
@itamar_mar
Good luck to... the future of software development. Exciting times ahead of us

Right now AI-empowered tools like AlphaCodium can do pretty well (93rd+ percentile) on code contests. In the future, this will translate to agents that can handle end-to-end sub-tasks on real world software




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
@svpino
Another step closer to having AI write code better than humans!

The new release of AlphaCodium, an open-source state-of-the-art code generation tool, outperforms directly prompting OpenAI when generating code.

This is a huge deal. The research team @QodoAI tested this on the Codeforces Code Contest benchmark, and the leap is huge:

Using o1-preview

• Direct prompting: 55%
• AlphaCodium: 78%

Using o1-mini

• Direct prompting: 53%
• AlphaCodium: 74%

These results make AlphaCodium the best approach to generate code we've seen so far.

I'm linking to a blog post with more information, the paper, and the GitHub repository below, but here is a 30-second summary of how AlphaCodium works:

AlphaCodium relies on an iterative process that repeatedly runs and fixes the generated code using the testing data.

1. The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.

2. Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output.

3. The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness.

4. Then, it generates more diverse tests for the problem, covering cases not part of the original public tests.

5. Iteratively, pick a solution, generate the code, and run it on a few test cases. If the tests fail, improve the code and repeat the process until the code passes every test.

There's a lot more information in the paper and the blog post. Here are the links:

• Blog: AlphaCodium Outperforms Direct Prompting of o1 Model
• Paper: [2401.08500] Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
• Code: GitHub - Codium-ai/AlphaCodium: Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

I attached an image comparing AlphaCodium with direct prompting using different models.



GZ8GWFmbwAAQAuH.jpg


2/11
@svpino
Take a look at this video for a bit more information about this:

[Quoted tweet]
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/


https://video.twimg.com/ext_tw_video/1846101570376630272/pu/vid/avc1/1152x720/89VrSUNLE2cOxwYl.mp4

3/11
@gpt_biz
This is exciting news about AlphaCodium's impressive advancements in code generation—definitely worth exploring further for anyone interested in AI and programming!



4/11
@MarkusOdenthal
Nice I know what I will today.

But seems like another tool. So again switching? 🤔



5/11
@TonyOConnell
I don’t think I’m too special but about 85 percent of my inference results in the code I want. I have good prompts that make AI code much better than this human anyway.



6/11
@AlirezaShabaniV
It is great to see that this is open source. Thanks for sharing this.



7/11
@simform
For tech startups and enterprises looking to scale their development process, tools like AlphaCodium should be at the top of your list.

The iterative testing and fixing process ensures that your generated code is not just functional but also resilient. This could reduce your debugging time drastically while improving code performance. Definitely worth integrating into CI/CD pipelines.



8/11
@Sai_2_7_9
We have Trained them & we should not forget that . Yeah due to multiple trainings and ot doesn't have any emotion or fear it lay write better. But we were the owners of AI



9/11
@tariusdamon
anyone got a graph showing coding capabilities over time? I’m evaluating whether to defer a big refactor for the inevitable.

such a weird time to know how to code



10/11
@digitalhealthxx
Impressive progress with AlphaCodium! The iterative approach to code generation, combined with reasoning about problem constraints and generating diverse tests, seems like a significant step forward. Achieving such a leap in accuracy on the Codeforces benchmark demonstrates the potential for automated code generation tools to truly enhance developer productivity. Exciting to see open-source solutions like this pushing the boundaries of what's possible in AI-driven coding!



11/11
@LeeLeepenkman
Is this going to be the same for people writing code?

We should probably all have more tests :D



GZ_OZjrb0AAJ2_H.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800





1/5
@rohanpaul_ai
In-Context Reinforcement Learning (ICRL) unlocks new learning paradigms for LLMs, enabling adaptation through reward signals alone, without parameter updates.

This paper's algorithm increases test-time compute, as well as a compute-bound approximation.

**Original Problem** 🤔:

LLMs exhibit in-context supervised learning, but can they perform In-Context Reinforcement Learning (ICRL) without parameter updates?

-----

**Solution in this Paper** 🧠:

• Proposed Explorative ICRL algorithm to address exploration deficiency
• Introduced stochasticity in prompt construction by randomly sampling past episodes
• Filtered out negative reward examples to simplify prompt reasoning
• Developed Approximate ICRL to reduce computational costs while maintaining performance

-----

**Key Insights from this Paper** 💡:

• Naive ICRL fails due to lack of exploration and difficulty learning from negative rewards
• LLMs can effectively learn from rewards alone through ICRL
• Stochasticity in context generation and focusing on positive examples are crucial for ICRL success
• Approximate ICRL offers a compute-efficient alternative to Explorative ICRL

-----

**Results** 📊:

• Explorative ICRL significantly outperformed zero-shot and naive ICRL across all tasks
• Banking-77 task: Llama improved from 17.2% zero-shot to 66.0% accuracy with Explorative ICRL
• Approximate ICRL reduced processed tokens by two orders of magnitude compared to Explorative
• Llama showed more robustness to approximation than Phi, requiring less computational budget



GaBYwFIXoAAZtrt.png


2/5
@rohanpaul_ai
🧠 How does in-context reinforcement learning (ICRL) differ from standard in-context learning (ICL)?

ICRL uses triplets of input, model prediction, and reward in the context, instead of input-output pairs used in ICL. The model has to learn from these reward signals rather than gold labels.



GaBZfrlWgAE__Xy.png


3/5
@rohanpaul_ai
Explorative ICRL was computationally expensive due to constructing a new context for each input. The authors proposed an Approximate ICRL method that maintains a fixed number of contexts and gradually expands them, reducing computational costs while maintaining performance.



4/5
@rohanpaul_ai
📚 [2410.05362] LLMs Are In-Context Reinforcement Learners



5/5
@rohanpaul_ai
🚫 The naive ICRL approach failed miserably, with models quickly degenerating to always predicting the same output. This was due to an inability to explore and difficulty learning from complex in-context signals like negative rewards.

🔍 So two key modifications are introduced by the authors to make to address the exploration problem in ICRL

The authors introduced stochasticity in prompt construction by randomly sampling past episodes to include in the context. They also filtered out negative reward examples, focusing only on positive rewards in the context.



GaBZ6HPWQAAT4qF.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800





1/3
@rohanpaul_ai
Standard Vision Transformer FAILS on Abstraction and Reasoning Corpus (ARC) tasks.

So this paper proposes Vision Transformer (ViT) for Abstraction and Reasoning Corpus (ARC) tasks, ViTARC architecture for visual reasoning

📊 Results:

~100% test solve rate on >50% of 400 public ARC tasks
Achieves via supervised learning from input-output grids

[Quoted tweet]
We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop’s @arcprize. Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔


GaBbpZ4XwAAmYwz.jpg

GZ8UkeMW0AAIedx.jpg


2/3
@rohanpaul_ai
💡 Insights:

- Highlights ViT's representational deficiency for structured mappings

- Stresses importance of inductive biases for abstract visual reasoning

- Demonstrates need for task-specific architectural modifications



3/3
@HCSolakoglu
"~100% test solve rate on >50% of 400 public ARC tasks" holy...




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/11
@WenhaoLi29
We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop’s @arcprize. Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔



GZ8UkeMW0AAIedx.jpg


2/11
@WenhaoLi29
We investigated and found that there exist fundamental limitations in the vanilla Vision Transformer preventing it from performing visual abstract reasoning. We propose enhancements to address these shortcomings in our new paper “Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects” ([2410.06405] Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects)



GZ8Up4YWsAAx7sP.jpg


3/11
@WenhaoLi29
Specifically, we found that: 1) ViT has limited spatial awareness due to the representation of the images using flattened image patches. We address these issues by introducing special 2D visual Tokens that enable the ViT to become spatially aware. 2) ViT has limited access to positional and object information. We tackle this using object-based positional encodings that allow the ViT to attend to the correct pixels within the image grid.



4/11
@WenhaoLi29
Implementing our enhancements, our framework “ViTARC” saw a significant improvement from the vanilla ViT! Task-specific models were able to achieve 100% accuracy on over half of the 400 ARC training tasks.



GZ8U1vHXwAAjaFW.jpg


5/11
@WenhaoLi29
Learn more about our work here: [2410.06405] Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
w/ @yudongxuwil @ScottSanner @lyeskhalil



6/11
@tinycrops
what happens if you use illumination to encode spacial relationship information in the image? @_jason_today Building Real-Time Global Illumination: Radiance Cascades



GZ9LWtmXsAAc0g4.png


7/11
@WenhaoLi29
Looks great! And yes, for OPE in our paper, you can use any external source of objectness information.



8/11
@JonathanRoseD
I've been playing around with just this idea to better read ARC patterns, but you're miles ahead here. Will/when this model be open sourced for us to play with? 🤠



9/11
@WenhaoLi29
Yes, we're working on it! The enhancements we mentioned are not too hard to implement on a raw CodeT5 or T5, so you could give it a try directly in the meantime.



10/11
@rkarmani
Did you submit it for evaluation over the private ARC tasks?



11/11
@WenhaoLi29
No, this isn’t an ARC solver yet (still working on generalization), but a solver still needs to read grids, so the enhancements are definitely relevant.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


1/11
@gordic_aleksa
How does OpenAI's o1 exactly work? Here is a list of papers & summaries on LLM reasoning that I've recently read.

I'll split them into 2 categories:

1) prompt-based - enforce step by step reasoning & self-correcting flow purely using prompts

2) learning-based - bake in the above into the policy model's weights (or into a verifier - usually a PRM; process reward model)

--- Prompt-based

0) CoT et al: https://arxiv.org/pdf/2201.11903

Ground zero (plus I'm a C guy :P iykyk). It started with Chain-of-Thought paper. This class of methods boils down to asking the LLM nicely to reveal its internal thoughts (e.g. "Let's think step by step"; more generally telling the model to disclose the intermediate computation steps in someway).

A simple variation on CoT would be "CoT self-consistency" -> i.e. sample multiple CoT traces in parallel and use majority voting to find the "right" answer.

1) ToT (tree of thought): https://arxiv.org/pdf/2305.10601

Further complexifies the above (in CS terms we go from linear list -> tree): build up an m-ary tree of intermediate thoughts (thoughts = intermediate steps of CoT); at each thought/node:

a) run the “propose next thoughts” prompt (or just sample completions m times)
b) evaluate those thoughts (either independently or jointly)
c) keep the top m

cons: very expensive & slow
pro: works with off-the-shelf LLMs

2) Self-Reflection: https://arxiv.org/pdf/2405.06682

If incorrect response pass the self-reflection feedback back to an LLM before attempting to re-answer again;

As an input self-reflection gets a gold answer and is prompted to explain how it would now solve the problem. The results are redacted before passing back the feedback to avoid leaking the solution.

Even re-answering given only a binary feedback (“your previous answer was incorrect”) is significantly stronger than the baseline (no feedback, just sample a response once).

3) Self-Contrast: https://arxiv.org/pdf/2401.02009

a) create multiple solutions by evaluating diverse prompts derived from the original question (yields diverse perspectives on how to solve the problem
b) pairwise contrast the solutions
c) generate a todo checklist in order to revise the generations from a)

4) Think Before You Speak: https://arxiv.org/pdf/2311.07445

Introduces the CSIM method: 5 prompts that make the model better at dialogue applications.

5 prompts they use to help improve the communication skills are "empathy", "topic transition", "proactively asking questions", "concept guidance", "summarizing often".

Their LLM has 2 roles: thinking and speaking. The thinking role, or the “inner monologue” is occasionally triggered by the 5 prompts and is not displayed to the user but is instead used as the input for the user-facing speaking role.

I think these 5 bullet points nicely captures the main modes of prompt-based methods i've observed -> lmk if I missed an important one!

--- learning-based -> coming up in the follow-up post



GZ7DjBAWQAASISN.jpg


2/11
@gordic_aleksa
One more here (I guess I should have classified this bucket as not a learning-based instead of prompt-based...anyhow :smile:):

Chain-of-Thought Reasoning w/o prompting: https://arxiv.org/pdf/2402.10200

Introduces CoT-decoding method showing that pre-trained LMs can reason w/o excplicit CoT prompting.

The idea is simple: sample top-k paths and compute the average difference between the probability of the 1st and 2nd most likely token along the answer’s token sequence. Take the path that maximizes this metric - that is likely the CoT path.

Has a good argument against the CoT self-consistency method: most of the solutions they sample are incorrect, only using their heuristic they pick the implicit CoT traces.



3/11
@Ki_Seki_here
There are more options here.

[Quoted tweet]
Here, I'd like to share one line: reasoning topologically. @OpenAI o1 uses one such technique, Quiet-STaR. Beyond this, there are other related techniques: optimizing the input , introducing more topological relationships, applying inference in decoding phase...

10/11


GXra6i_aIAAa5ko.jpg


4/11
@bate5a55
Fascinating breakdown! Interestingly, OpenAI's o1 reportedly employs a dynamic Tree of Thoughts during inference, allowing it to adaptively optimize reasoning paths in real-time—almost like crafting its own cognitive map on the fly.



5/11
@sabeerawa05
Superb!



6/11
@CoolMFcat
Youtube video when



7/11
@Mikuk84
Thanks so much for the paper list!



8/11
@marinac_dev
What about swarm they released recently?
IMO I think o1 is swarm principle just on steroids.



9/11
@attention_gan
Does that mean it is not actually thinking but trying to use a combination of steps seen in the data



10/11
@joao_monoirelia
.



11/11
@MarkoVelich
Nice overview. Recently I did a talk on this topic. Put a bunch of reference throughout: Intuition Behind Reasoning in LLMs
My view on all these post-training/prompting techniques is that it is basically a filtering of the pre-train corpus. Ways to squeeze out good traces from the model.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/9
@gordic_aleksa
How does OpenAI's o1 exactly work? Part 2. Here is a list of papers & summaries on LLM reasoning that I've recently read. All learning-based.

0) STaR: Self-Taught Reasoner https://arxiv.org/pdf/2203.14465

Ground Zero. Instead of always having to CoT prompt, bake that into the default model behavior. Given a dataset of (question, answer) pairs, manually curate a few CoT traces and use them as fewshot examples to generate (rationale, answer) tuples, given a question, for the rest of the dataset ("bootstrap reasoning"). SFT on (question, rationale, answer) triples. Do multiple iterations of this i.e. collect data retrain and keep sampling from that new policy (in practice I've observed people use ~3 iterations). Con: can only learn from correct triples.

1) V-STaR: https://arxiv.org/pdf/2402.06457

A variation on STaR. Collect both correct/incorrect samples. Incorrect ones can be used to train a verifier LLM via DPO. During inference additionally use the verifier to rank the generations.

---
[Recommended reading for 2) below]
Let’s Verify Step by Step: https://arxiv.org/pdf/2305.20050

They train a Process Supervision Reward Model PRM and show its advantage over Outcome Supervision RM - ORM (i.e. don't just pass a whole generation and ask for how good it is, pass in individual CoT elements instead; finer resolution).

Data collection: Humans were employed to annotate each line of step-by-step solutions (math problems) - expensive!! They've been given a curated list of samples that were automatically selected as “convincing wrong-answer” solutions - a form of hard sample mining.
---

2) Improve Mathematical Reasoning in LMs: https://arxiv.org/pdf/2406.06592

Proposes OmegaPRM - a process RM trained on data collected using MCTS (AlphaGo vibes? yes, paper comes from DeepMind). As a policy for collecting PRM data they used a SFT-ed Gemini Pro (instruct samples distilled from Gemini Ultra). Gemini Pro combined with OmegaPRM-weighted majority voting yielded nice improvements on math benchmarks. Suitable only when there is a golden answer, doesn’t work for open-ended tasks.

3) Learn Beyond The Answer: https://arxiv.org/pdf/2406.12050

Instead of increasing the number of samples in SFT dataset (STaR-like) they augment/extend the samples by appending a reflection to the existing (q,a) tuple. Reflection = alternative reasoning and follow-ups like analogy & abstraction). Complementary to STaR.

4) Quiet-STaR: https://arxiv.org/pdf/2403.09629

Uses RL (reinforce method) and picks rationales, which branch off of a thought, that increase the likelihood of the correct answer and subsequently trains on them. Adds new start/end of thinking tokens, and an MLP for mixing the rationale stream with the default stream. Interesting ideas, I feel that due to its complexity it won't survive Bitter Lesson's filter.

5) Scaling LLM Test-Time Compute: https://arxiv.org/pdf/2408.03314

They first estimate the difficulty of a user query placing it into one out of 5 difficulty bins (i think they need 2048 samples for this!). Then depending on the query’s difficulty they deploy various techniques to estimate the optimal results.

They experimented with 2 categories of methods: search requires a PRM verifier) & revisions (afaik they only did intra-comparisons, i.e. they didn’t compare search vs revision).

For search they experimented with best-of-N-weighted, beam, and lookahead.

For easier Qs best-of-N is the best method, later it’s beam (lookahead never pays off). For revision they first do SFT on samples that consist out of 0-4 incorrect intermediate steps followed by the correct solution (to teach it to self-correct). Subsequently they test applying it sequentially vs in parallel. There exists an “optimal” ratio of sequential-to-parallel depending on the difficulty bin.

6) Agent Q: https://arxiv.org/pdf/2408.07199

Really the main idea is to use MCTS to do test-time search (according to the perf they get not according to paper authors).

The second core idea is to leverage successful & unsuccessful trajectories collected using a “guided MCTS” (guided by an LLM judge) to improve the base policy via DPO, this gives them the “Agent Q”. Agent Q + test-time MCTS yields the best results.

Note: they operate in a very specific environment - web and focus on a narrow task: booking a table.

7) Training LMs to Self-Correct via RL: https://arxiv.org/pdf/2409.12917

They introduce SCoRe, a multi-turn (multi=2) RL method. They show that STaR methods fail to improve in a multi-turn setup (due to behavior collapse as measured by the edit distance histogram; i.e. STaR models are reluctant to deviate significantly from their 1st solution).

Training proceeds in 2 stages:

1) train the base model to produce high reward responses at the 2nd attempt while forcing the model not to change its 1st attempt (via KL div)

2) jointly maximize reward of both attempts; in order to prevent a collapse to a non-self-correcting behavior they do reward shaping by adding an additional penalty that rewards the model if it achieves higher reward on the 2nd attempt and heavily penalizes the transitions from correct->incorrect.

---

That's a wrap for now, so how does o1 work?

No idea, but likely involves above ideas + a shyt ton of compute & scale (data, params).

Let me know if I missed an important paper that's not an obvious interpolation of the above ideas (unless it's seminal despite that? :smile:).



GZ7SLALXYAAhpdw.jpg


2/9
@gordic_aleksa
thanks @_philschmid for aggregating the list LLM Reasoning Papers - a philschmid Collection



3/9
@JUQ_AI
Are you impressed with o1?



4/9
@barrettlattimer
Thanks for making this list! Another is
Sparse Rewards Can Self-Train Dialogue Agents: https://arxiv.org/pdf/2409.04617

Introduces JOSH, an algorithm that uses sparse rewards to extract the ideal behavior of a model in multi turn dialogue and improves quality for small & frontier models



5/9
@neurosp1ke
I would be very interested in creating a “must-read-essentials” short-list of the GitHub - open-thought/system-2-research: System 2 Reasoning Link Collection list. I will take your list as a first candidate.



6/9
@axel_pond
top quality content, king



7/9
@synthical_ai
Dark mode for this paper for those who read at night 🌙 STaR: Bootstrapping Reasoning With Reasoning



8/9
@MapaloChansa1
Interesting



9/9
@eatnow240008
[2310.04363] Amortizing intractable inference in large language models
Perhaps this?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800





1/11
@akshay_pachaar

Microsoft just changed the game! 🔥

They've open-sourced bitnet.cpp: a blazing-fast 1-bit LLM inference framework that runs directly on CPUs.

Why is this a game-changer❓

You can now run 100B parameter models on local devices with up to 6x speed improvements and 82% less energy consumption—all without a GPU!

The future we've been waiting for: fast, efficient, and private AI that works anytime, anywhere.✨

Link to the GitHub repo in next tweet!
_____

Find me → @akshay_pachaar ✔️

For more insights and tutorials on AI and Machine Learning!

https://video.twimg.com/ext_tw_video/1847312397738143745/pu/vid/avc1/1058x720/pqz1GvjKHQYwx2Yw.mp4

2/11
@akshay_pachaar

Here's the official GitHub repo: GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs
------

Interested in ML/AI Engineering? Sign up for our newsletter for in-depth lessons and get a FREE pdf with 150+ core DS/ML lessons: Daily Dose of Data Science Newsletter

3/11
@MarkPommrehn

Cool! Thanks! That's a wonderful leap in accessibility and cost-effective usability!

4/11
@akshay_pachaar

Absolutely! 💯

5/11
@stephen_rayner

Limitations?

6/11
@akshay_pachaar

The quality needs to improve a lot, but it's a good start

7/11
@bchewyme

Thanks for sharing! 🙌

8/11
@akshay_pachaar

You’re welcome!

9/11
@joaomendoncaaaa

It's literally hallucinating like hell on your example. Look at the stdout, it's all infinitely repeated sentences.

1bit quant is great, but let's breathe for a second lol

10/11
@c___f___b

Nice. With such progress they will move from 1-bit to 1-trit architecture next year to get rid of hallucinations.

11/11
@ATeRSa_NDUcC

Your video shows it's literally braindead. What use is a 100B model that has been quantized to lobotomy levels?

GaNKkwMXEAAkiDk.png


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Supported Models​

bitnet.cpp supports a list of 1-bit models available on Hugging Face, which are trained with research settings. We hope the release of bitnet.cpp can inspire more 1-bit LLMs trained in large-scale settings.

I2_STL1TL2
ModelParametersCPUKernel
bitnet_b1_58-large0.7Bx86
ARM
bitnet_b1_58-3B3.3Bx86
ARM
Llama3-8B-1.58-100B-tokens8.0Bx86
ARM


Fine-tuning LLMs to 1.58bit: extreme quantization made easy​


Published September 18, 2024
Update on GitHub

medmekk Mohamed Mekkouri
marcsun13 Marc Sun
lvwerra Leandro von Werra
pcuenq Pedro Cuenca
osanseviero Omar Sanseviero
thomwolf Thomas Wolf

As Large Language Models (LLMs) grow in size and complexity, finding ways to reduce their computational and energy costs has become a critical challenge. One popular solution is quantization, where the precision of parameters is reduced from the standard 16-bit floating-point (FP16) or 32-bit floating-point (FP32) to lower-bit formats like 8-bit or 4-bit. While this approach significantly cuts down on memory usage and speeds up computation, it often comes at the expense of accuracy. Reducing the precision too much can cause models to lose crucial information, resulting in degraded performance.

BitNet is a special transformers architecture that represents each parameter with only three values: (-1, 0, 1), offering a extreme quantization of just 1.58 ( log2(3)log2(3) ) bits per parameter. However, it requires to train a model from scratch. While the results are impressive, not everybody has the budget to pre-train an LLM. To overcome this limitation, we explored a few tricks that allow fine-tuning an existing model to 1.58 bits! Keep reading to find out how !
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


Nvidia just dropped a new AI model that crushes OpenAI’s GPT-4—no big launch, just big results​

Michael Nuñez@MichaelFNunez

October 16, 2024 6:45 PM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney

Nvidia quietly unveiled a new artificial intelligence model on Tuesday that outperforms offerings from industry leaders OpenAI and Anthropic, marking a significant shift in the company’s AI strategy and potentially reshaping the competitive landscape of the field.

The model, named Llama-3.1-Nemotron-70B-Instruct, appeared on the popular AI platform Hugging Face without fanfare, quickly drawing attention for its exceptional performance across multiple benchmark tests.

Nvidia reports that their new offering achieves top scores in key evaluations, including 85.0 on the Arena Hard benchmark, 57.6 on AlpacaEval 2 LC, and 8.98 on the GPT-4-Turbo MT-Bench.

These scores surpass those of highly regarded models like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, catapulting Nvidia to the forefront of AI language understanding and generation.


Nvidia’s AI gambit: From GPU powerhouse to language model pioneer​


This release represents a pivotal moment for Nvidia. Known primarily as the dominant force in graphics processing units (GPUs) that power AI systems, the company now demonstrates its capability to develop sophisticated AI software. This move signals a strategic expansion that could alter the dynamics of the AI industry, challenging the traditional dominance of software-focused companies in large language model development.

Nvidia’s approach to creating Llama-3.1-Nemotron-70B-Instruct involved refining Meta’s open-source Llama 3.1 model using advanced training techniques, including Reinforcement Learning from Human Feedback (RLHF). This method allows the AI to learn from human preferences, potentially leading to more natural and contextually appropriate responses.

With its superior performance, the model has the potential to offer businesses a more capable and cost-efficient alternative to some of the most advanced models on the market.

The model’s ability to handle complex queries without additional prompting or specialized tokens is what sets it apart. In a demonstration, it correctly answered the question “How many r’s are in strawberry?” with a detailed and accurate response, showcasing a nuanced understanding of language and an ability to provide clear explanations.

What makes these results particularly significant is the emphasis on “alignment,” a term in AI research that refers to how well a model’s output matches the needs and preferences of its users. For enterprises, this translates into fewer errors, more helpful responses, and ultimately, better customer satisfaction.


How Nvidia’s new model could reshape business and research​


For businesses and organizations exploring AI solutions, Nvidia’s model presents a compelling new option. The company offers free hosted inference through its build.nvidia.com platform, complete with an OpenAI-compatible API interface.

This accessibility makes advanced AI technology more readily available, allowing a broader range of companies to experiment with and implement advanced language models.

The release also highlights a growing shift in the AI landscape toward models that are not only powerful but also customizable. Enterprises today need AI that can be tailored to their specific needs, whether that’s handling customer service inquiries or generating complex reports. Nvidia’s model offers that flexibility, along with top-tier performance, making it a compelling option for businesses across industries.

However, with this power comes responsibility. Like any AI system, Llama-3.1-Nemotron-70B-Instruct is not immune to risks. Nvidia has cautioned that the model has not been tuned for specialized domains like math or legal reasoning, where accuracy is critical. Enterprises will need to ensure they are using the model appropriately and implementing safeguards to prevent errors or misuse.


The AI arms race heats up: Nvidia’s bold move challenges tech giants​


Nvidia’s latest model release signals just how fast the AI landscape is shifting. While the long-term impact of Llama-3.1-Nemotron-70B-Instruct remains uncertain, its release marks a clear inflection point in the competition to build the most advanced AI systems.

By moving from hardware into high-performance AI software, Nvidia is forcing other players to reconsider their strategies and accelerate their own R&D. This comes on the heels of the company’s introduction of the NVLM 1.0 family of multimodal models, including the 72-billion-parameter NVLM-D-72B.

These recent releases, particularly the open-source NVLM project, have shown that Nvidia’s AI ambitions go beyond just competing—they are challenging the dominance of proprietary systems like GPT-4o in areas ranging from image interpretation to solving complex problems.

The rapid succession of these releases underscores Nvidia’s ambitious push into AI software development. By offering both multimodal and text-only models that compete with industry leaders, Nvidia is positioning itself as a comprehensive AI solutions provider, leveraging its hardware expertise to create powerful, accessible software tools.

Nvidia’s strategy seems clear: it’s positioning itself as a full-service AI provider, combining its hardware expertise with accessible, high-performance software. This move could reshape the industry, pushing rivals to innovate faster and potentially sparking more open-source collaboration across the field.

As developers test Llama-3.1-Nemotron-70B-Instruct, we’re likely to see new applications emerge across sectors like healthcare, finance, education, and beyond. Its success will ultimately depend on whether it can turn impressive benchmark scores into real-world solutions.

In the coming months, the AI community will closely watch how Llama-3.1-Nemotron-70B-Instruct performs in real-world applications beyond benchmark tests. Its ability to translate high scores into practical, valuable solutions will ultimately determine its long-term impact on the industry and society at large.

Nvidia’s deeper dive into AI model development has intensified the competition. If this is the beginning of a new era in artificial intelligence, it’s one where fully integrated solutions may set the pace for future breakthroughs.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


Archetype AI’s Newton model learns physics from raw data—without any help from humans​

Michael Nuñez@MichaelFNunez

October 17, 2024 9:00 AM

nuneybits_Vector_art_of_Newtons_famous_prism_experiment_b99ac895-a0d3-429c-910d-6f1aeff49b97.webp




Researchers at Archetype AI have developed a foundational AI model capable of learning complex physics principles directly from sensor data, without any pre-programmed knowledge. This breakthrough could significantly change how we understand and interact with the physical world.

The model, named Newton, demonstrates an unprecedented ability to generalize across diverse physical phenomena, from mechanical oscillations to thermodynamics, using only raw sensor measurements as input. This achievement, detailed in a paper released today, represents a major advance in artificial intelligence’s capacity to interpret and predict real-world physical processes.

“We’re asking if AI can discover the laws of physics on its own, the same way humans did through careful observation and measurement,” said Ivan Poupyrev, co-founder of Archetype AI, in an exclusive interview with VentureBeat. “Can we build a single AI model that generalizes across diverse physical phenomena, domains, applications, and sensing apparatuses?”


From pendulums to power grids: AI’s uncanny predictive powers​


Trained on over half a billion data points from diverse sensor measurements, Newton has shown remarkable versatility. In one striking demonstration, it accurately predicted the chaotic motion of a pendulum in real-time, despite never being trained on pendulum dynamics.

The model’s capabilities extend to complex real-world scenarios as well. Newton outperformed specialized AI systems in forecasting citywide power consumption patterns and predicting temperature fluctuations in power grid transformers.

“What’s remarkable is that Newton had not been specifically trained to understand these experiments — it was encountering them for the first time and was still able to predict outcomes even for chaotic and complex behaviors,” Poupyrev told VentureBeat.

Screenshot-2024-10-17-at-1.01.47%E2%80%AFAM.png
Performance comparison of Archetype AI’s ‘Newton’ model across various complex physical processes. The graph shows that the model, even without specific training (zero-shot), often outperforms or matches models trained specifically for each task, highlighting its potential for broad applicability. (Credit: Archetype AI)


Adapting AI for industrial applications​


Newton’s ability to generalize to entirely new domains could significantly change how AI is deployed in industrial and scientific applications. Rather than requiring custom models and extensive datasets for each new use case, a single pre-trained foundation model like Newton might be adapted to diverse sensing tasks with minimal additional training.

This approach represents a significant shift in how AI can be applied to physical systems. Currently, most industrial AI applications require extensive custom development and data collection for each specific use case. This process is time-consuming, expensive, and often results in models that are narrowly focused and unable to adapt to changing conditions.

Newton’s approach, by contrast, offers the potential for more flexible and adaptable AI systems. By learning general principles of physics from a wide range of sensor data, the model can potentially be applied to new situations with minimal additional training. This could dramatically reduce the time and cost of deploying AI in industrial settings, while also improving the ability of these systems to handle unexpected situations or changing conditions.

Moreover, this approach could be particularly valuable in situations where data is scarce or difficult to collect. Many industrial processes involve rare events or unique conditions that are challenging to model with traditional AI approaches. A system like Newton, which can generalize from a broad base of physical knowledge, might be able to make accurate predictions even in these challenging scenarios.


Expanding human perception: AI as a new sense​


The implications of Newton extend beyond industrial applications. By learning to interpret unfamiliar sensor data, AI systems like Newton could expand human perceptual capabilities in new ways.

“We have sensors now that can detect aspects of the world humans can’t naturally perceive,” Poupyrev told VentureBeat. “Now we can start seeing the world through sensory modalities which humans don’t have. We can enhance our perception in unprecedented ways.”

This capability could have profound implications across a range of fields. In medicine, for example, AI models could help interpret complex diagnostic data, potentially identifying patterns or anomalies that human doctors might miss. In environmental science, these models could help analyze vast amounts of sensor data to better understand and predict climate patterns or ecological changes.

The technology also raises intriguing possibilities for human-computer interaction. As AI systems become better at interpreting diverse types of sensor data, we might see new interfaces that allow humans to “sense” aspects of the world that were previously imperceptible. This could lead to new tools for everything from scientific research to artistic expression.

Archetype AI, a Palo Alto-based startup founded by former Google researchers, has raised $13 million in venture funding to date. The company is in discussions with potential customers about real-world deployments, focusing on areas such as predictive maintenance for industrial equipment, energy demand forecasting, and traffic management systems.

The approach also shows promise for accelerating scientific research by uncovering hidden patterns in experimental data. “Can we discover new physical laws?” Poupyrev mused. “It’s an exciting possibility.”

“Our main goal at Archetype AI is to make sense of the physical world,” Poupyrev told VentureBeat. “To figure out what the physical world means.”

As AI systems become increasingly adept at interpreting the patterns underlying physical reality, that goal may be within reach. The research opens new possibilities – from more efficient industrial processes to scientific breakthroughs and novel human-computer interfaces that expand our understanding of the physical world.

For now, Newton remains a research prototype. But if Archetype AI can successfully bring the technology to market, it could usher in a new era of AI-powered insight into the physical world around us.

The challenge now will be to move from promising research results to practical, reliable systems that can be deployed in real-world settings. This will require not only further technical development, but also careful consideration of issues like data privacy, system reliability, and the ethical implications of AI systems that can interpret and predict physical phenomena in ways that might surpass human capabilities.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


OpenAI just launched ChatGPT for Windows—and it’s coming for your office software​

Michael Nuñez@MichaelFNunez

October 17, 2024 11:14 AM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney

OpenAI, the artificial intelligence powerhouse behind ChatGPT, has taken another step in its quest for ubiquity by releasing a Windows desktop application for its popular AI chatbot. The move, announced Thursday, follows the earlier launch of a macOS client and marks a significant push by OpenAI to embed its technology more deeply into users’ daily workflows.

The new Windows app, currently available in preview to ChatGPT Plus, Enterprise, Team, and Edu subscribers, allows users to access the AI assistant via a keyboard shortcut (Alt + Space) from anywhere on their PC. This seamless integration aims to boost productivity by making AI assistance readily available without the need to switch to a web browser.

GaG1fg3acAA3Y3x.jpg
OpenAI’s new ChatGPT desktop application for Windows, showing a user interface with conversation history. (Credit: OpenAI)


OpenAI’s desktop strategy: More than just convenience​


OpenAI’s strategy of platform expansion goes beyond mere convenience. By creating native applications for major operating systems, the company is positioning ChatGPT as an indispensable tool in both personal and professional environments. This move serves multiple purposes: it increases user engagement, facilitates more extensive data collection for model improvement, and creates a sticky ecosystem that could be challenging for competitors to displace.

The desktop app approach also reveals OpenAI’s ambition to become the de facto AI assistant for knowledge workers. By integrating ChatGPT more deeply into users’ workflows, OpenAI is not just improving accessibility but potentially reshaping how people interact with computers and process information.


Enterprise ambitions: ChatGPT as the new office suite?​


The Windows release comes at a critical juncture for OpenAI, as the company faces increasing competition in the AI space and scrutiny over its rapid growth and influential position. Recent reports suggest that OpenAI is exploring partnerships beyond its well-known Microsoft alliance, including discussions with Oracle for AI data center infrastructure and pitches to the U.S. military and national security establishment.

OpenAI’s aggressive expansion into desktop environments signals a potential shift in the enterprise software landscape. The company appears to be positioning ChatGPT as a fundamental productivity tool for businesses, potentially disrupting traditional enterprise software providers. This move, coupled with the recent partnership expansion with Bain & Company to sell ChatGPT to businesses, suggests OpenAI is not content with being merely an AI research lab but is actively pursuing a dominant position in the commercial AI sector.

The implications of this strategy are huge. If successful, ChatGPT could become the new “operating system” for knowledge work, fundamentally changing how businesses operate and potentially displacing or absorbing functions currently served by separate software suites.


Balancing Act: Innovation, ethics, and commercialization​


However, OpenAI’s rapid growth and increasing influence have not been without controversy. The company’s AI models have faced scrutiny over potential biases and the societal implications of widespread AI deployment. Additionally, OpenAI’s dual status as a capped-profit company with significant commercial interests has raised questions about its governance and long-term objectives.

As OpenAI continues to expand its reach, the company faces a delicate balancing act. It must navigate the tensions between its stated mission of ensuring artificial general intelligence benefits humanity and its increasingly commercial focus. The Windows app release, while a seemingly straightforward product expansion, represents another step in OpenAI’s complex journey of shaping the future of AI in both consumer and enterprise contexts.

The success of this desktop strategy could cement OpenAI’s position as the leading AI company, but it also increases the urgency of addressing ethical concerns and potential monopolistic practices. As ChatGPT becomes more deeply integrated into daily work and life, the stakes for getting AI right — in terms of safety, fairness, and societal impact — have never been higher.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


1/3
@ai_for_success
Newton AI : Researchers at Archetype AI have developed a foundational AI model ( Newton AI) capable of learning complex physics principles directly from sensor data, without any pre-programmed knowledge.

It can accurately predict behaviors of systems it wasn't explicitly trained on, such as pendulum motion.



GaQS7eKXQAAvmAS.jpg


2/3
@ai_for_success
Newton AI @PhysicalAI



https://video.twimg.com/amplify_video/1847003664487202816/vid/avc1/1280x720/WAPRMQ_3-lzdsJqg.mp4

3/3
@ImMr_Wise
How do we replicate the dataset is all I'm thinking 🤔




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@ai_for_success
Newton AI : Researchers at Archetype AI have developed a foundational AI model ( Newton AI) capable of learning complex physics principles directly from sensor data, without any pre-programmed knowledge.

It can accurately predict behaviors of systems it wasn't explicitly trained on, such as pendulum motion.



GaQS7eKXQAAvmAS.jpg


2/3
@ai_for_success
Newton AI @PhysicalAI



https://video.twimg.com/amplify_video/1847003664487202816/vid/avc1/1280x720/WAPRMQ_3-lzdsJqg.mp4

3/3
@ImMr_Wise
How do we replicate the dataset is all I'm thinking 🤔




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


Microsoft’s Differential Transformer cancels attention noise in LLMs​

Ben dikkson@BenDee983

October 16, 2024 1:16 PM


Robot signal to noise


Image credit: VentureBeat with DALL-E 3


Improving the capabilities of large language models (LLMs) in retrieving in-prompt information remains an area of active research that can impact important applications such as retrieval-augmented generation (RAG) and in-context learning (ICL).

Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings.


Transformers and the “lost-in-the-middle” phenomenon​


The Transformer architecture is the foundation of most modern LLMs. It uses an attention mechanism to weigh the importance of different parts of the input sequence when generating output. The attention mechanism employs the softmax function, which normalizes a vector of values into a probability distribution. In Transformers, the softmax function assigns attention scores to different tokens in the input sequence.

However, studies have shown that Transformers struggle to retrieve key information from long contexts.

“We began by investigating the so-called ‘lost-in-the-middle’ phenomenon,” Furu Wei, Partner Research Manager at Microsoft Research, told VentureBeat, referring to previous research findings that showed that LLMs “do not robustly make use of information in long input contexts” and that “performance significantly degrades when models must access relevant information in the middle of long contexts.”

Wei and his colleagues also observed that some LLM hallucinations, where the model produces incorrect outputs despite having relevant context information, correlate with spurious attention patterns.

“For example, large language models are easily distracted by context,” Wei said. “We analyzed the attention patterns and found that the Transformer attention tends to over-attend irrelevant context because of the softmax bottleneck.”

The softmax function used in Transformer’s attention mechanism tends to distribute attention scores across all tokens, even those that are not relevant to the task. This can cause the model to lose focus on the most important parts of the input, especially in long contexts.

“Previous studies indicate that the softmax attention has a bias to learn low-frequency signals because the softmax attention scores are restricted to positive values and have to be summed to 1,” Wei said. “The theoretical bottleneck renders [it] such that the classic Transformer cannot learn sparse attention distributions. In other words, the attention scores tend to flatten rather than focusing on relevant context.”


Differential Transformer​


Differential transformer
Differential Transformer (source: arXiv)

To address this limitation, the researchers developed the Diff Transformer, a new foundation architecture for LLMs. The core idea is to use a “differential attention” mechanism that cancels out noise and amplifies the attention given to the most relevant parts of the input.

The Transformer uses three vectors to compute attention: query, key, and value. The classic attention mechanism performs the softmax function on the entire query and key vectors.

The proposed differential attention works by partitioning the query and key vectors into two groups and computing two separate softmax attention maps. The difference between these two maps is then used as the attention score. This process eliminates common noise, encouraging the model to focus on information that is pertinent to the input.

The researchers compare their approach to noise-canceling headphones or differential amplifiers in electrical engineering, where the difference between two signals cancels out common-mode noise.

While Diff Transformer involves an additional subtraction operation compared to the classic Transformer, it maintains efficiency thanks to parallelization and optimization techniques.

“In the experimental setup, we matched the number of parameters and FLOPs with Transformers,” Wei said. “Because the basic operator is still softmax, it can also benefit from the widely used FlashAttention cuda kernels for acceleration.”

In retrospect, the method used in Diff Transformer seems like a simple and intuitive solution. Wei compares it to ResNet, a popular deep learning architecture that introduced “residual connections” to improve the training of very deep neural networks. Residual connections made a very simple change to the traditional architecture yet had a profound impact.

“In research, the key is to figure out ‘what is the right problem?’” Wei said. “Once we can ask the right question, the solution is often intuitive. Similar to ResNet, the residual connection is an addition, compared with the subtraction in Diff Transformer, so it wasn’t immediately apparent for researchers to propose the idea.”


Diff Transformer in action​


The researchers evaluated Diff Transformer on various language modeling tasks, scaling it up in terms of model size (from 3 billion to 13 billion parameters), training tokens, and context length (up to 64,000 tokens).

Their experiments showed that Diff Transformer consistently outperforms the classic Transformer architecture across different benchmarks. A 3-billion-parameter Diff Transformer trained on 1 trillion tokens showed consistent improvements of several percentage points compared to similarly sized Transformer models.

Further experiments with different model sizes and training dataset sizes confirmed the scalability of Diff Transformer. Their findings suggest that in general, Diff Transformer requires only around 65% of the model size or training tokens needed by a classic Transformer to achieve comparable performance.

Diff Transformer performance
The Diff Transformer is more efficient than the classic Transformer in terms of both parameters and train tokens (source: arXiv)

The researchers also found that Diff Transformer is particularly effective in using increasing context lengths. It showed significant improvements in key information retrieval, hallucination mitigation, and in-context learning.

While the initial results are promising, there’s still room for improvement. The research team is working on scaling Diff Transformer to larger model sizes and training datasets. They also plan to extend it to other modalities, including image, audio, video, and multimodal data.

The researchers have released the code for Diff Transformer, implemented with different attention and optimization mechanisms. They believe the architecture can help improve performance across various LLM applications.

“As the model can attend to relevant context more accurately, it is expected that these language models can better understand the context information with less in-context hallucinations,” Wei said. “For example, for the retrieval-augmented generation settings (such as Bing Chat, Perplexity, and customized models for specific domains or industries), the models can generate more accurate responses by conditioning on the retrieved documents.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


Pika 1.5 updates again to add even more AI video Pikaffects: crumble, dissolve, deflate, ta-da​

Carl Franzen@carlfranzen

October 16, 2024 2:42 PM


Pikaffect dissolve sample image


Credit: Pika AI

Pika a.k.a Pika Labs or Pika AI, the Palo Alto, California-based startup that has raised $55 million to disrupt video production with its video AI models of the same name, is further expanding the free special effects users can access through its web-based AI image-to-video generator.

Pika 1.5, its latest AI video model, now includes the ability to crumble, dissolve, deflate and “ta-da” video subjects — the last of these essentially making a video subject disappear behind a cloth.

Users can simply upload an image to the site and Pika 1.5 will turn it into a video with a corresponding animation. The user guides which animation is used by selecting it from a button beside the “Image” attachment icon (paperclip) labeled “Pikaeffect” with a magic wand beside it.

Screenshot-2024-10-16-at-5.24.13%E2%80%AFPM-1.png


The new AI powered special effects — or “Pikaffects, in the company’s parlance — join six others previously unveiled earlier this month: Explode, squish, melt, crush, inflate and “cake-ify,” the latter of which turns any uploaded still image into an “is it cake?” video where the answer is a resounding “yes!”

Unfortunately, VentureBeat has been unable to use the new effects yet as when we attempted, the site said “We’re experiencing high demand right now (how flattering)!”

Nonetheless, as the AI landscape evolves, Pika’s unique approach to video manipulation sets it apart from the growing field of AI-driven content generation.

While Pikaffects cater to users seeking creative transformations, traditional features like lip-syncing and AI sound effects remain accessible on the earlier Pika 1.0 model. Paid subscribers have the flexibility to switch between Pika 1.5 and 1.0, depending on their project needs.


Where Pika came from​


Pika Labs, co-founded by former Stanford AI researchers Demi Guo and Chenlin Meng, first launched its AI video platform in late 2023. The company has rapidly scaled, reaching over half a million users in less than a year.

Unlike many AI video platforms that focus primarily on realism, Pika takes a different route by prioritizing creative manipulation.

These effects enable users to reshape video subjects in ways that are not just visually impactful but also technologically intriguing, offering hands-on AI practitioners a sandbox for experimentation.

For professionals managing machine learning models or integrating new AI tools, Pika Labs’ latest features could present new opportunities to deploy innovative content solutions.

The platform allows the quick application of effects through a user-friendly interface while still enabling deeper integration via text-to-video (T2V) and image-to-video (I2V) workflows.


Subscription pricing​


To accommodate a diverse range of users, Pika Labs offers four subscription plans:

  1. Basic (Free): This entry-level plan provides 150 monthly video credits and access to the Pika 1.5 features, making it suitable for casual users or those curious about the platform.
  2. Standard ($8/month, billed yearly): With 700 monthly credits, access to both Pika 1.5 and Pika 1.0, and faster generation times, this plan offers more flexibility for content creators looking to produce more videos.
  3. Pro ($28/month, billed yearly): This plan includes 2,000 monthly credits and even faster generation times, catering to users with higher content demands.
  4. Unlimited ($76/month, billed yearly): Designed for power users, this plan allows unlimited video credits, offering the fastest generation times available on the platform.

The updated credit structure (15 credits per five-second clip) allows for a scalable approach to video generation. The various subscription tiers accommodate different needs, from light experimentation to intensive production, ensuring that both individual contributors and larger teams can find an affordable solution.

These flexible pricing options make Pika Labs accessible to smaller teams and larger organizations alike, allowing AI engineers to manage costs while experimenting with new video capabilities.


Attempting to differentiate amid a crowded sea of competitors​


The move by Pika to further differentiate its video AI model from competitors such as Runway, Luma, Kling, and Hailuo comes amid intensifying competition in the nascent industry, and follows Adobe’s move this week at its MAX conference in Miami Beach, Florida, to begin offering a preview of its own “enterprise safe” AI video model Firefly Video, trained on licensed data.

Pika, like most other generative AI startups, has not disclosed its precise training dataset. Other rivals such as Runway have been sued by artists for alleged copyright infringement over training AI models on data scraped from the web, including many other artworks and videos, and likely many copyrighted ones. That case, which also names AI image generator Midjourney and Stability, is moving forward toward a trial but has yet to be decided.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800

Mistral AI’s new language models bring AI power to your phone and laptop​

Michael Nuñez@MichaelFNunez

October 16, 2024 10:47 AM


Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney

Mistral AI, a rising star in the artificial intelligence arena, launched two new language models on Wednesday, potentially reshaping how businesses and developers deploy AI technology.

The Paris-based startup’s new offerings, Ministral 3B and Ministral 8B, are designed to bring powerful AI capabilities to edge devices, marking a significant shift from the cloud-centric approach that has dominated the industry.

These compact models, collectively dubbed “les Ministraux,” are surprisingly capable despite their small size. Ministral 3B, with just 3 billion parameters, outperforms Mistral’s original 7 billion parameter model on most benchmarks. Its larger sibling, Ministral 8B, boasts performance rivaling models several times its size.

pretrain_table.png


Performance comparison of AI language models across various benchmarks. Mistral AI’s new Ministral 3B and 8B models (highlighted in bold) show competitive results against larger models from Google (Gemma) and Meta (Llama), particularly in knowledge, commonsense, and multilingual tasks. Higher scores indicate better performance. (Credit: Mistral)

Edge AI: Bringing intelligence closer to users​


The significance of this release extends far beyond technical specifications. By enabling AI to run efficiently on smartphones, laptops, and IoT devices, Mistral is opening doors to applications previously considered impractical due to connectivity or privacy constraints.

This shift towards edge computing could make advanced AI capabilities more accessible, bringing them closer to end-users and addressing privacy concerns associated with cloud-based solutions.

Consider a scenario where a factory robot needs to make split-second decisions based on visual input. Traditionally, this would require sending data to a cloud server for processing, introducing latency and potential security risks. With Ministral models, the AI can run directly on the robot, enabling real-time decision-making without external dependencies.

This edge-first approach also has profound implications for personal privacy. Running AI models locally on devices means sensitive data never leaves the user’s possession.

This could significantly impact applications in healthcare, finance, and other sectors where data privacy is paramount. It represents a fundamental shift in how we think about AI deployment, potentially alleviating concerns about data breaches and unauthorized access that have plagued cloud-based systems.

pretrain_with_gemma.png


Comparative performance of AI language models across key benchmarks. Mistral AI’s new Ministral 3B and 8B models (in orange) demonstrate competitive or superior accuracy compared to larger models from Google (Gemma) and Meta (Llama), particularly in multilingual capabilities and knowledge tasks. The chart illustrates the potential of more compact models to rival their larger counterparts. (Credit: Mistral)

Balancing efficiency and environmental impact​


Mistral’s timing aligns with growing concerns about AI’s environmental impact. Large language models typically require significant computational resources, contributing to increased energy consumption.

By offering more efficient alternatives, Mistral is positioning itself as an environmentally conscious choice in the AI market. This move aligns with a broader industry trend towards sustainable computing, potentially influencing how companies approach their AI strategies in the face of growing climate concerns.

The company’s business model is equally noteworthy. While making Ministral 8B available for research purposes, Mistral is offering both models through its cloud platform for commercial use.

This hybrid approach mirrors successful strategies in the open-source software world, fostering community engagement while maintaining revenue streams.

By nurturing a developer ecosystem around their models, Mistral is creating a robust foundation against larger competitors, a strategy that has proven effective for companies like Red Hat in the Linux space.

Navigating challenges in a competitive landscape​


The AI landscape is becoming increasingly crowded. Tech giants like Google and Meta have released their own compact models, while OpenAI continues to dominate headlines with its GPT series.

Mistral’s focus on edge computing could carve out a distinct niche in this competitive field. The company’s approach suggests a future where AI is not just a cloud-based service, but an integral part of every device, fundamentally changing how we interact with technology.

However, challenges remain. Deploying AI at the edge introduces new complexities in model management, version control, and security. Enterprises will need robust tooling and support to effectively manage a fleet of edge AI devices.

This shift could spawn an entirely new industry focused on edge AI management and security, similar to how the rise of cloud computing gave birth to a plethora of cloud management startups.

Mistral seems aware of these challenges. The company is positioning its new models as complementary to larger, cloud-based systems. This approach allows for flexible architectures where edge devices handle routine tasks, while more complex queries are routed to more powerful models in the cloud. It’s a pragmatic strategy that acknowledges the current limitations of edge computing while still pushing the boundaries of what’s possible.

The technical innovations behind les Ministraux are equally impressive. Ministral 8B employs a novel “interleaved sliding-window attention” mechanism, allowing it to process long sequences of text more efficiently than traditional models.

Both models support context lengths of up to 128,000 tokens, translating to about 100 pages of text—a feature that could be particularly useful for document analysis and summarization tasks. These advancements represent a leap forward in making large language models more accessible and practical for everyday use.

As businesses grapple with the implications of this technology, several key questions emerge. How will edge AI impact existing cloud infrastructure investments? What new applications will become possible with always-available, privacy-preserving AI? How will regulatory frameworks adapt to a world where AI processing is decentralized? The answers to these questions will likely shape the trajectory of the AI industry in the coming years.

Mistral’s release of compact, high-performing AI models signals more than just a technical evolution—it’s a bold reimagining of how AI will function in the very near future.

This move could disrupt traditional cloud-based AI infrastructures, forcing tech giants to rethink their dependence on centralized systems. The real question is: in a world where AI is everywhere, will the cloud still matter?
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,800


Anthropic just made it harder for AI to go rogue with its updated safety policy​

Michael Nuñez@MichaelFNunez

October 15, 2024 12:16 PM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney

Anthropic, the artificial intelligence company behind the popular Claude chatbot, today announced a sweeping update to its Responsible Scaling Policy (RSP), aimed at mitigating the risks of highly capable AI systems.

The policy, originally introduced in 2023, has evolved with new protocols to ensure that AI models, as they grow more powerful, are developed and deployed safely.

This revised policy sets out specific Capability Thresholds—benchmarks that indicate when an AI model’s abilities have reached a point where additional safeguards are necessary.

The thresholds cover high-risk areas such as bioweapons creation and autonomous AI research, reflecting Anthropic’s commitment to prevent misuse of its technology. The update also brings more detailed responsibilities for the Responsible Scaling Officer, a role Anthropic will maintain to oversee compliance and ensure that the appropriate safeguards are in place.

Anthropic’s proactive approach signals a growing awareness within the AI industry of the need to balance rapid innovation with robust safety standards. With AI capabilities accelerating, the stakes have never been higher.


Why Anthropic’s Responsible Scaling Policy matters for AI risk management


Anthropic’s updated Responsible Scaling Policy arrives at a critical juncture for the AI industry, where the line between beneficial and harmful AI applications is becoming increasingly thin.

The company’s decision to formalize Capability Thresholds with corresponding Required Safeguards shows a clear intent to prevent AI models from causing large-scale harm, whether through malicious use or unintended consequences.

The policy’s focus on Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and Autonomous AI Research and Development (AI R&D) highlights areas where frontier AI models could be exploited by bad actors or inadvertently accelerate dangerous advancements.

These thresholds act as early-warning systems, ensuring that once an AI model demonstrates risky capabilities, it triggers a higher level of scrutiny and safety measures before deployment.

This approach sets a new standard in AI governance, creating a framework that not only addresses today’s risks but also anticipates future threats as AI systems continue to evolve in both power and complexity.


How Anthropic’s


Anthropic’s policy is more than an internal governance system—it’s designed to be a blueprint for the broader AI industry. The company hopes its policy will be “exportable,” meaning it could inspire other AI developers to adopt similar safety frameworks. By introducing AI Safety Levels (ASLs) modeled after the U.S. government’s biosafety standards, Anthropic is setting a precedent for how AI companies can systematically manage risk.

The tiered ASL system, which ranges from ASL-2 (current safety standards) to ASL-3 (stricter protections for riskier models), creates a structured approach to scaling AI development. For example, if a model shows signs of dangerous autonomous capabilities, it would automatically move to ASL-3, requiring more rigorous red-teaming (simulated adversarial testing) and third-party audits before it can be deployed.

If adopted industry-wide, this system could create what Anthropic has called a “race to the top” for AI safety, where companies compete not only on the performance of their models but also on the strength of their safeguards. This could be transformative for an industry that has so far been reluctant to self-regulate at this level of detail.


The role of the responsible scaling officer in AI risk governance


A key feature of Anthropic’s updated policy is the expanded responsibilities of the Responsible Scaling Officer (RSO)—a role that Anthropic will continue to maintain from the original version of the policy. The updated policy now details the RSO’s duties, which include overseeing the company’s AI safety protocols, evaluating when AI models cross Capability Thresholds, and reviewing decisions on model deployment.

This internal governance mechanism adds another layer of accountability to Anthropic’s operations, ensuring that the company’s safety commitments are not just theoretical but actively enforced. The RSO has the authority to pause AI training or deployment if the safeguards required at ASL-3 or higher are not in place.

In an industry moving at breakneck speed, this level of oversight could become a model for other AI companies, particularly those working on frontier AI systems with the potential to cause significant harm if misused.


Why Anthropic’s policy update is a timely response to growing AI regulation


Anthropic’s updated policy comes at a time when the AI industry is under increasing pressure from regulators and policymakers. Governments across the U.S. and Europe are debating how to regulate powerful AI systems, and companies like Anthropic are being watched closely for their role in shaping the future of AI governance.

The Capability Thresholds introduced in this policy could serve as a prototype for future government regulations, offering a clear framework for when AI models should be subject to stricter controls. By committing to public disclosures of Capability Reports and Safeguard Assessments, Anthropic is positioning itself as a leader in AI transparency—an issue that many critics of the industry have highlighted as lacking.

This willingness to share internal safety practices could help bridge the gap between AI developers and regulators, providing a roadmap for what responsible AI governance could look like at scale.


Looking ahead: What Anthropic’s Responsible Scaling Policy means for the future of AI development


As AI models become more powerful, the risks they pose will inevitably grow. Anthropic’s updated Responsible Scaling Policy is a forward-looking response to these risks, creating a dynamic framework that can evolve alongside AI technology. The company’s focus on iterative safety measures—with regular updates to its Capability Thresholds and Safeguards—ensures that it can adapt to new challenges as they arise.

While the policy is currently specific to Anthropic, its broader implications for the AI industry are clear. As more companies follow suit, we could see the emergence of a new standard for AI safety, one that balances innovation with the need for rigorous risk management.

In the end, Anthropic’s Responsible Scaling Policy is not just about preventing catastrophe—it’s about ensuring that AI can fulfill its promise of transforming industries and improving lives without leaving destruction in its wake.
 
Top