bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693



1/6
@omarsar0
Mistral AI is doubling down on small language models.

Their latest Ministral models (both the 3B and 8B) are pretty impressive and will be incredibly useful for a lot of LLM workflows.

Some observations:

I enjoy seeing how committed Mistral AI is to developing smaller and more capable models.

They seem to understand what developers want and need today.

There is huge competition for the finest, smallest, and cheapest models. This is good for the AI developer community.

This sets up the community really well in terms of the wave of innovation that’s coming around on-device AI and agentic workflows. 2025 is going to be a wild year.

They don’t mention the secret sauce behind these capable smaller models (probably some distillation happening), the Ministral 3B model already performs competitively with Mistral 7B. I think this is a great focus of Mistral as they seek to differentiate from other LLM providers.

Given this announcement, I am now super curious about what the next Gemma and Llama small models are going to bring. Mini models are taking over!

I use small models for processing data, structuring information, function calling, routing, evaluation pipelines, prompt chaining, agentic workflows, and a whole lot more.



GaCOZmZXYAAi9xC.jpg


2/6
@omarsar0
More thoughts here: https://invidious.poast.org/dh1i3YAwd1I



3/6
@thomvlieshout
For what do you use small models?



4/6
@omarsar0
I mentioned a few ways I use them towards the end the first post.



5/6
@M_Kasinski
What do you use as a small model for function calling for my part nothing suits me. If you have to call a suite of functions



6/6
@remotelama
refreshing pivot to efficient, compact models! interpretive yet concise approaches open new possibilities. impressive versatility.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693


1/5
@omarsar0
This is an interesting new research paper from OpenAI evaluating the fairness of ChatGPT.

I recorded a 10 min overview of it if anyone is interested: https://invidious.poast.org/Cb81ERsB-KI

[Quoted tweet]
We’re releasing a new study that explores how users’ names can impact ChatGPT’s responses. openai.com/index/evaluating-…


GZ9hxSJX0AANXMu.jpg


2/5
@omarsar0
One aspect of these types of studies that I like is that you can gain better intuition as to how these LLMs understand and perform tasks for different domains and problems.

For instance, OpenAI researchers found that open-ended tasks (entertainment and art) with longer responses were more likely to include a harmful stereotype. This begs the question of whether shorter responses are actually better in terms of safety and accuracy. A recent paper highlighted the same. The other question is how the temperature value influences this.

The instruction that was used with GPT-4o to analyze patterns across ChatGPT conversations is interesting as well. It can be used for other applications beyond safety such as building verification or fact-checkers. We have been doing something similar at @dair_ai using GPT-4o-mini.

I talk more about these details in my video.



3/5
@gpt_biz
Sounds like a valuable resource for understanding AI fairness better, thanks for sharing!



4/5
@bate5a55
Interesting that GPT-4o is evaluating its own outputs—using the same model for assessment can introduce bias feedback loops that many aren’t aware of in AI fairness studies.



5/5
@HenrikMolgard
This is one of the most important papers I have seen in a long time. We need to recognize which biases we are feeding into the AIs. If they go unnotized they will only grow stronger. I made a deepdive on the paper. Find it here:

[Quoted tweet]
Exploring fairness in #AI chatbots: How do names and biases impact user experience? Learn about the findings from an #OpenAI paper in our latest #podcast on 'First-Person Fairness in Chatbots.' 🤖⚖️

Listen here: invidious.poast.org/Rnugrj6vL5Q

#AI #Bias #MythosAI


GaCJ7X-W0AE2c0g.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693


1/10
@omarsar0
Thinking LLMs

How difficult is it to train LLMs to do explicit "thinking" before responding to questions or tasks?

This work proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data.

It uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision.

Thought candidates for each user instruction are scored with a judge model. Note that only the responses are evaluated by the Judge which determines the best and worst ones.

Then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). This entails the full training process that involves multiple iterations.

Overall, this is a simple yet very effective approach to incentivizing the model to generate its own thoughts without explicitly teaching it how to think. The authors also find that these Thinking LLMs are effective even in problems that often don't rely on reasoning or CoT methods.



GZ8eZQ8b0AIjtkp.jpg


2/10
@omarsar0
Paper: [2410.10630] Thinking LLMs: General Instruction Following with Thought Generation



3/10
@ninataft
AI's new thinking cap: Now with less coffee, more thought bubbles, and zero existential crises! 🧠😆



4/10
@BensenHsu
The paper focuses on training large language models (LLMs) to have the ability to "think" before providing a response, similar to how humans think before answering a complex question. This is called "Thinking LLMs". The authors argue that thinking can be beneficial for a wide range of tasks, not just logical or math-related ones.

The results show that the initial seed model performs poorly when asked to generate thoughts, compared to the direct response baseline. However, after multiple iterations of TPO training, the Thinking LLM outperforms the direct response baseline on general instruction-following benchmarks like AlpacaEval and Arena-Hard. The authors also find that thinking benefits not only reasoning and problem-solving tasks, but also categories like marketing, health, and general knowledge.

full paper: THINKING LLMS: GENERAL INSTRUCTION FOLLOWING WITH THOUGHT GENERATION



GZ8vii0aYAALKlu.jpg


5/10
@LaurenceBrem
From the abstract, it still feels like inference is happening in some form before responding. Still impressive!



6/10
@bytetweets
It’s not thinking, it is reasoning.



7/10
@Anonymo74027124
Interesting concept. Curious to see the results of this training method. Will keep an eye on its progress.



8/10
@jackshiels
It’s actually very interesting how so many researchers missed this. The need for compute probably left many with an implicit assumption that ‘compute is bad’, meaning it was overlooked.



9/10
@bate5a55
Interesting that the 'Judge' component evaluates only the responses, allowing the LLM to generate unfiltered thoughts. This might enable emergent metacognition, a capability rarely seen in AI models.



10/10
@gpt_biz
This sounds like an exciting development in AI research! It's impressive to see models being trained to think more independently without needing human annotations. Looking forward to the future possibilities!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693


1/9
@omarsar0
This is the first report I've seen where out-of-the-box o1 CoT reasoning is significantly outperformed on code generation.

They propose AlphaCodium which acts as a strategy provider (i.e., flow engineering) on top of o1 to fuel and guide o1's chain-of-thought capabilities.

AlphaCodium is built as a multi-stage flow improving on reasoning and reliability, while significantly elevating o1-preview's performance (55% --> 78% pass@5 accuracy on the Codeforces benchmark).

Author's quote: "The promise of AlphaCodium is clear: with the right strategic flow-engineering, foundational models like o1 can be nudged toward System II thinking. We still have work to do to cross the gap from “System 1.5” to a true System 2-level AI, but when observing tools like AlphaCodium, we can better understand the gap and keep researching to get closer."

o1 is a significantly better model than most LLMs out there today. However, it can benefit from strategic guidance as shown in this report. I am also working on a similar approach for knowledge-intensive tasks. From my own analysis, it does feel like o1 has better knowledge of complex tasks but still shows limitations to complex knowledge understanding and reasoning. More on this soon.



GZ3Ysikb0AIXfPf.jpg


2/9
@omarsar0
Details here: AlphaCodium Outperforms Direct Prompting of o1 Model



3/9
@kubertai
Hmm, I'm testing it with a simple use case to fix a unit test, and it keeps giving incorrect suggestions. Here, the suggestion is incorrect, breaks the working code and does not update the test. Maybe it's a user error. I will give it some more time.



GZ7NLTFWMAA5GTD.jpg


4/9
@itamar_mar
I've recorded 5min explanation:
x.com

Great summary @omarsar0 , thank you ♥️

[Quoted tweet]
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/


5/9
@TaNGSoFT
As I said after watching Chollet's speech, which can be regarded as clearing my doubts about the boundary of current LLM intelligence:

The attention mechanism realized by the Transformer architecture is the process of the digital neural network for the text. The abstraction he said here is similar to the “compress” said by Ilya. This process is value-centric. The continuous abstraction of value centric is suitable for intention understanding, knowledge retrieve, and intuitive persuasion. It is especially suitable for conversational applications, so the abstraction of next word prediction is fine;

but reasoning needs the progra of type2. The abstraction of m-centric is the abstraction of discrete thinking. It seems that another layer of logical structure needs to be built on the text. This structure is very certain - so the word program is used. And the attention of the transformer architecture alone is not enough, and a different breakthrough is needed.

While here, Alphacodium did things like the program structure.



6/9
@feulf
not a fan of alphacodium, I've tried it before, it's pretty much a copy of cursor, but it asks access to your full codebase, where (according the founders) cursor only access your codebase locally and pushes the embeddings + a merkle tree of the changes to cache it.



7/9
@TheLobbyistGuy
It adds in a planning phase or two before generating the code.

Anyone who has used LLMs to do much coding knows that this improves performance a great deal.



8/9
@bate5a55
Interesting to see AlphaCodium's "Problem Reflection" phase—an uncommon feature in AI models. By integrating self-reflection, it's mimicking expert programmers' introspection, which might explain its edge over o1's CoT.



9/9
@gpt_biz
This is exciting! AlphaCodium sounds like a game changer in improving o1's reasoning and performance




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/12
@itamar_mar
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/



https://video.twimg.com/ext_tw_video/1846101570376630272/pu/vid/avc1/1152x720/89VrSUNLE2cOxwYl.mp4

2/12
@itamar_mar
2/
‣System 1 – quick inference
‣System 1.5 – guided chain-of-thought
‣System 2 – deep, deliberate reasoning reinforced by information arriving from validation processes, using & acquiring relevant thinking frameworks and tools, including devising & choosing between options



3/12
@itamar_mar
3/ AlphaCodium is an open-source research tool that boosts the performance of various leading models, including o1-preview

It employs a code-oriented, multi-stage flow that emphasizes continuous improvement through iteration

Designed by the team at @QodoAI



GZ4OwzYWEAAsFmy.jpg

GZ4PClBWgAA7z5c.jpg


4/12
@itamar_mar
4/ If o1 was already exhibiting System 2 thinking, I claim it would have gained less from being wrapped up with AlphaCodium

More in the video above and in the blog: AlphaCodium Outperforms Direct Prompting of o1 Model

By the way - does System II thinking mean AGI?



5/12
@beyondthisjourn
Sam alman said that O1 is only at gpt-2 level of reasoning. Till we get to gpt-4 level its not yet system 2



6/12
@itamar_mar
Yea, saw that.
Makes sense to me.

Here is another similar quote by him

[Quoted tweet]
here is o1, a series of our most capable and aligned models yet:

openai.com/index/learning-to…

o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.


GXStyAIW8AEQDka.jpg


7/12
@DavidBensh
We don't need more nomenclature 😅
Neuroscience actually distinguishes system 2 vs 1 mainly as Semantic (and therefore conscious) vs Associative (therefore automatic and unconcious).
I want to see either semantic references or consciousness before this question becomes relevant.



8/12
@itamar_mar
Interesting.

How would a conscious model/AI-system look like?



9/12
@SiVola
Congrats! Looks very promising. Can we use it in Cursor?



10/12
@itamar_mar
AlphaCodium is a research tool introduced by @QodoAI.

At Qodo, we develop various tools, including Qodo Gen which is an IDE extension that can be installed in Cursor.
However, AlphaCodium itself isn't directly implemented in Qodo Gen, rather the learnings are continuously being integrated.

In the future, AlphaCodium will become a flow that will work for real-world software (and then we aim to call it Qodo Flow; the UX/UI is still not disclosed)



11/12
@roybenjos
Amazing! Good luck!



12/12
@itamar_mar
Good luck to... the future of software development. Exciting times ahead of us

Right now AI-empowered tools like AlphaCodium can do pretty well (93rd+ percentile) on code contests. In the future, this will translate to agents that can handle end-to-end sub-tasks on real world software




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
@svpino
Another step closer to having AI write code better than humans!

The new release of AlphaCodium, an open-source state-of-the-art code generation tool, outperforms directly prompting OpenAI when generating code.

This is a huge deal. The research team @QodoAI tested this on the Codeforces Code Contest benchmark, and the leap is huge:

Using o1-preview

• Direct prompting: 55%
• AlphaCodium: 78%

Using o1-mini

• Direct prompting: 53%
• AlphaCodium: 74%

These results make AlphaCodium the best approach to generate code we've seen so far.

I'm linking to a blog post with more information, the paper, and the GitHub repository below, but here is a 30-second summary of how AlphaCodium works:

AlphaCodium relies on an iterative process that repeatedly runs and fixes the generated code using the testing data.

1. The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.

2. Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output.

3. The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness.

4. Then, it generates more diverse tests for the problem, covering cases not part of the original public tests.

5. Iteratively, pick a solution, generate the code, and run it on a few test cases. If the tests fail, improve the code and repeat the process until the code passes every test.

There's a lot more information in the paper and the blog post. Here are the links:

• Blog: AlphaCodium Outperforms Direct Prompting of o1 Model
• Paper: [2401.08500] Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
• Code: GitHub - Codium-ai/AlphaCodium: Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

I attached an image comparing AlphaCodium with direct prompting using different models.



GZ8GWFmbwAAQAuH.jpg


2/11
@svpino
Take a look at this video for a bit more information about this:

[Quoted tweet]
Introducing AlphaCodium-o1 ✨
Achieves state-of-the-art results on the Codeforces Code Contest benchmark, outperforming o1-ioi !

Recently, @gdb suggested that o1 unlocks System II thinking.
However, I disagree and propose that o1 operates at a System 1.5 level of thinking

1/


https://video.twimg.com/ext_tw_video/1846101570376630272/pu/vid/avc1/1152x720/89VrSUNLE2cOxwYl.mp4

3/11
@gpt_biz
This is exciting news about AlphaCodium's impressive advancements in code generation—definitely worth exploring further for anyone interested in AI and programming!



4/11
@MarkusOdenthal
Nice I know what I will today.

But seems like another tool. So again switching? 🤔



5/11
@TonyOConnell
I don’t think I’m too special but about 85 percent of my inference results in the code I want. I have good prompts that make AI code much better than this human anyway.



6/11
@AlirezaShabaniV
It is great to see that this is open source. Thanks for sharing this.



7/11
@simform
For tech startups and enterprises looking to scale their development process, tools like AlphaCodium should be at the top of your list.

The iterative testing and fixing process ensures that your generated code is not just functional but also resilient. This could reduce your debugging time drastically while improving code performance. Definitely worth integrating into CI/CD pipelines.



8/11
@Sai_2_7_9
We have Trained them & we should not forget that . Yeah due to multiple trainings and ot doesn't have any emotion or fear it lay write better. But we were the owners of AI



9/11
@tariusdamon
anyone got a graph showing coding capabilities over time? I’m evaluating whether to defer a big refactor for the inevitable.

such a weird time to know how to code



10/11
@digitalhealthxx
Impressive progress with AlphaCodium! The iterative approach to code generation, combined with reasoning about problem constraints and generating diverse tests, seems like a significant step forward. Achieving such a leap in accuracy on the Codeforces benchmark demonstrates the potential for automated code generation tools to truly enhance developer productivity. Exciting to see open-source solutions like this pushing the boundaries of what's possible in AI-driven coding!



11/11
@LeeLeepenkman
Is this going to be the same for people writing code?

We should probably all have more tests :D



GZ_OZjrb0AAJ2_H.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693

1/3
@llama_index
Instead of finetuning your LLMs, try dynamic few-shot prompting instead 💡

With dynamic few-shot prompting, instead of injecting a fixed set of examples into the prompt, you retrieve a dynamic set of examples based on the query - so you find relevant examples that are relevant towards solving your input task.

This is helpful for use cases like customer support, text-to-SQL, structured output, and more.

@clusteredbytes has a great resource repo showing how this works using @llama_index workflows, check it out: GitHub - rsrohan99/dynamic-few-shot-llamaindex-workflow

For more details on workflows: Workflows - LlamaIndex

[Quoted tweet]
Dynamic Few-shot prompting is an alternative to finetuning LLMs where:

- we can get consistent output for many use-cases
- rapid iteration
- takes effect right away
- easy to implement

Here's how to implement Dynamic Few-shot prompting using @llama_index workflows 👇


GZ-Okf3b0AAXdc6.jpg


https://video.twimg.com/amplify_video/1846251770000908288/vid/avc1/1280x720/MWdPaPLA8U-dnRgE.mp4

2/3
@ben_burtenshaw
Nice overview. If the LLM is relying on just a few examples, it's useful to also manually review them.

[Quoted tweet]
You can now connect your llama index application to Argilla to collect feedback on the app.


GZnYex_XwAYFGZ4.jpg


3/3
@b05crypto
So, here's a problem I would love to hear experts weigh in on.

You have a programming environment/language with at least 250 functions. You want you AI to be able to spit out the JSON required for the programming environment to use the functions and build an app with a UI also detailed in the JSON.

You want it to consistently make the right choices about the functions/blocks it chooses.

What is the best solution?

RAG?
Fine-tuning?
Better prompting?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/5
@clusteredbytes
Dynamic Few-shot prompting is an alternative to finetuning LLMs where:

- we can get consistent output for many use-cases
- rapid iteration
- takes effect right away
- easy to implement

Here's how to implement Dynamic Few-shot prompting using @llama_index workflows 👇



https://video.twimg.com/amplify_video/1846251770000908288/vid/avc1/1280x720/MWdPaPLA8U-dnRgE.mp4

2/5
@clusteredbytes
Instead of putting a static list of examples in the prompt, we dynamically pull the best examples from our database based on user query.

GitHub repo: GitHub - rsrohan99/dynamic-few-shot-llamaindex-workflow



3/5
@jessphilips5
Tried something similar before:
Ollama(codegemma + Nomic), chroma vec, oracle dB, streamlit with langchain chat history frontend. Mixed results, not good enough for production. Will try the above 👍



4/5
@AG_AItech
I'm excited to try out dynamic few-shot prompting with llama_index workflows. The potential for rapid iteration and consistent output is huge. Great job sharing this idea!



5/5
@xDavidHuang
How is this different from DSPy?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693


1/4
@rohanpaul_ai
Transformer and Human Brain.

Discussion by Andrej Karpathy

-------

Video Credit - Original video from "No Priors: AI, Machine Learning, Tech, & Startups" YouTube Channel (Link in comment)



2/4
@rohanpaul_ai
Link to Original video from "No Priors: AI, Machine Learning, Tech, & Startups" YouTube Channel

https://invidious.poast.org/watch?v=hM_h0UA7upI



3/4
@SashaKrecinic
The part the brain is good at is knowing what information to disregard. RL with transformers is the next paradigm. Then we will have the best of both worlds



4/4
@GuardedCode
Intriguing discussion by Andrej Karpathy on Transformer and Human Brain. I've often wondered about AI's potential to enhance human cognition - does it complement or replace our abilities?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693

1/3
@youraimarketer
This is a turning point in AI agent adoption.

By keeping everything on-device, it cuts out the lag from cloud-based processing and adds a serious layer of privacy protection.

No need to send sensitive data anywhere.

Congrats @Lenovo!

[Quoted tweet]
Announced today — Lenovo AI Now is a new on-device AI agent, built with Llama 3.1 to enable capabilities ranging from document management and summarization, to device control and content generation.

Exciting to see this kind of AI innovation from the team at @Lenovo. 🦙


2/3
@MikeBirdTech
Keep pushing intelligence to the edge!



3/3
@Ed_AgentDev
This alignment of on-device AI and data privacy is music to my ears. Many of us in AI development have been awaiting a solution like this. Kudos to the Lenovo team for pushing the boundaries.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/10
@AIatMeta
Announced today — Lenovo AI Now is a new on-device AI agent, built with Llama 3.1 to enable capabilities ranging from document management and summarization, to device control and content generation.

Exciting to see this kind of AI innovation from the team at @Lenovo. 🦙

[Quoted tweet]
The advances we demo at #LenovoTechWorld aren’t just technical feats. They represent a strategic vision for the future of computing.

This year, we showed off a world where AI is personalized, productive, protected, and for all. lnv.gy/3Nu5S2A


GZ9zON0b0AA6Mz2.jpg


2/10
@CohorteAI
That’s a big move from Lenovo! The new Lenovo AI Now agent, built with Llama 3.1, brings exciting on-device AI capabilities like document management, summarization, device control, and content generation. It’s great to see this level of AI innovation from Lenovo—definitely a step toward more powerful and integrated AI experiences! /search?q=#AI /search?q=#Lenovo /search?q=#Llama3.1 /search?q=#Innovation



3/10
@bjornfix
I'm very happy to see it also runs Linux, not just Windows. Definitely interested in looking into this when it get's available here in Egypt.



4/10
@cognitivetech_
that's so fire, and what the space has needed. not a central server but local ai assistant not å permenent spy device communicating to amazon constantly



5/10
@koltregaskes
This looks fancy.



6/10
@ac_crypto
can it run @exolabs?



7/10
@koltregaskes
But can it run Crysis?



8/10
@koltregaskes
An AI "agent" though?



9/10
@SkillsGapTrain
Great thinking, Meta and Lenovo!

What we truly need is a rugged, waterproof, field AI device that we can take on expeditions to the Rocky Mountains.

It would allow us to perform off-grid on device AI processing and accept USB database of millions of PDF documents to load in for analysis, and one can tgen design future inventions, right there in the wilderness, on the travels, or solve engineering challenges along the way such as burned out transformers or damaged Tesla vehicles.

This would significantly enhance our STEM capabilities in the field, enabling innovation, enhanced capability, upgraded human function, and problem-solving in real-time, no matter the environment or challenge.



GZ_6hXhWEAAbvQt.jpg

GZ_6hUlXsAAHqDW.jpg

GZ_6hVRXoAEKNuy.jpg


10/10
@gpt_biz
This AI agent sounds super helpful for streamlining tasks and boosting productivity, can't wait to try it out!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693


1/4
@rohanpaul_ai
Deep Learning Cheatsheet



GaBWnNuXkAAQ6iC.jpg


2/4
@rohanpaul_ai




GaBWwcPWEAAuTbp.jpg


3/4
@osalasmo
This image offers a concise overview of neural networks, covering key topics such as neural network architecture, activation functions (including Sigmoid, Tanh, ReLU, and Leaky ReLU), learning rate, backpropagation, weight updating, dropout, and convolutional neural networks. It also includes mathematical formulas for concepts like cross-entropy loss and batch normalization. The cheatsheet provides visual diagrams of a neural network structure and graphs of different activation functions, as well as detailed steps for the weight updating process in a neural network. This summary is designed to be a quick and comprehensive reference for students and professionals in the field of deep learning.



4/4
@osalasmo
Ahora es español: la imagen ofrece una visión general concisa de las redes neuronales, cubriendo temas clave como la arquitectura de redes neuronales, funciones de activación (incluyendo Sigmoid, Tanh, ReLU y Leaky ReLU), tasa de aprendizaje, backpropagation, actualización de pesos, dropout y redes neuronales convolucionales. También incluye fórmulas matemáticas para conceptos como la pérdida de entropía cruzada y la normalización por lotes. La hoja de trucos proporciona diagramas visuales de la estructura de una red neuronal y gráficos de diferentes funciones de activación, así como pasos detallados para el proceso de actualización de pesos en una red neuronal. Este resumen está diseñado para ser una referencia rápida y completa para estudiantes y profesionales en el campo del aprendizaje profundo.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693




1/4
@rohanpaul_ai
Switch Scaling sparse autoencoders (SAEs) offer compute-efficient scaling for sparse autoencoders in LLM feature extraction.

**Original Problem** 🔍:

Scaling sparse autoencoders (SAEs) for decomposing neural network activations into interpretable features is computationally expensive, limiting their application to frontier language models.

-----

**Solution in this Paper** 🛠️:

• Introduces Switch Sparse Autoencoders (Switch SAEs)
• Combines Switch layer with TopK SAE
• Uses multiple expert SAEs and a routing network
• Routes input activations to the most probable expert
• Avoids dense matrix multiplication in encoder
• Trains router and expert SAEs simultaneously
• Balances reconstruction and expert utilization

-----

**Key Insights from this Paper** 💡:

• Switch SAEs improve FLOPs vs. training loss over existing methods
• Feature duplication across experts reduces SAE capacity
• Encoder features cluster by expert, while decoder features are more diffuse
• Switch SAEs show bias towards duplicate frequent features

-----

**Results** 📊:

• Switch SAEs achieve better reconstruction than dense SAEs at fixed compute budget
• FLOP-matched Switch SAEs Pareto-dominate TopK, Gated, and ReLU SAEs
• Width-matched Switch SAEs perform slightly worse than TopK SAEs but outperform ReLU SAEs
• Switch SAEs reduce FLOPs per activation by up to 128x while maintaining ReLU SAE performance
• Feature interpretability remains similar to TopK SAEs



GaDelhsWMAAQ8C6.png


2/4
@rohanpaul_ai
🧠 How does the Switch SAE architecture work?

The Switch SAE consists of multiple "expert" SAEs and a routing network. The router determines which expert SAE should process a given input activation vector. Each expert SAE resembles a TopK SAE without a bias term. This approach leverages conditional computation to avoid dense matrix multiplication in the encoder, reducing computational costs.



GaDezqTWwAAJXXW.png


3/4
@rohanpaul_ai
Switch SAEs are promising for scaling sparse autoencoders, particularly for huge training runs on large clusters of GPUs. Each expert can be placed on a separate GPU, leading to significant wall clock training speed ups. The main limitation is the reduction in performance at a fixed number of parameters compared to dense SAEs.

[2410.08201] Efficient Dictionary Learning with Switch Sparse Autoencoders



4/4
@rohanpaul_ai
🔍 How do Switch SAEs compare to other SAE architectures in terms of sparsity and reconstruction error?

FLOP-matched Switch SAEs Pareto-dominate TopK, Gated, and ReLU SAEs in terms of both mean squared error (MSE) and loss recovered. Width-matched Switch SAEs perform slightly worse than TopK SAEs but Pareto-dominate ReLU SAEs while performing fewer FLOPs.



GaDfGzTWoAA1YuK.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693





1/5
@rohanpaul_ai
In-Context Reinforcement Learning (ICRL) unlocks new learning paradigms for LLMs, enabling adaptation through reward signals alone, without parameter updates.

This paper's algorithm increases test-time compute, as well as a compute-bound approximation.

**Original Problem** 🤔:

LLMs exhibit in-context supervised learning, but can they perform In-Context Reinforcement Learning (ICRL) without parameter updates?

-----

**Solution in this Paper** 🧠:

• Proposed Explorative ICRL algorithm to address exploration deficiency
• Introduced stochasticity in prompt construction by randomly sampling past episodes
• Filtered out negative reward examples to simplify prompt reasoning
• Developed Approximate ICRL to reduce computational costs while maintaining performance

-----

**Key Insights from this Paper** 💡:

• Naive ICRL fails due to lack of exploration and difficulty learning from negative rewards
• LLMs can effectively learn from rewards alone through ICRL
• Stochasticity in context generation and focusing on positive examples are crucial for ICRL success
• Approximate ICRL offers a compute-efficient alternative to Explorative ICRL

-----

**Results** 📊:

• Explorative ICRL significantly outperformed zero-shot and naive ICRL across all tasks
• Banking-77 task: Llama improved from 17.2% zero-shot to 66.0% accuracy with Explorative ICRL
• Approximate ICRL reduced processed tokens by two orders of magnitude compared to Explorative
• Llama showed more robustness to approximation than Phi, requiring less computational budget



GaBYwFIXoAAZtrt.png


2/5
@rohanpaul_ai
🧠 How does in-context reinforcement learning (ICRL) differ from standard in-context learning (ICL)?

ICRL uses triplets of input, model prediction, and reward in the context, instead of input-output pairs used in ICL. The model has to learn from these reward signals rather than gold labels.



GaBZfrlWgAE__Xy.png


3/5
@rohanpaul_ai
Explorative ICRL was computationally expensive due to constructing a new context for each input. The authors proposed an Approximate ICRL method that maintains a fixed number of contexts and gradually expands them, reducing computational costs while maintaining performance.



4/5
@rohanpaul_ai
📚 [2410.05362] LLMs Are In-Context Reinforcement Learners



5/5
@rohanpaul_ai
🚫 The naive ICRL approach failed miserably, with models quickly degenerating to always predicting the same output. This was due to an inability to explore and difficulty learning from complex in-context signals like negative rewards.

🔍 So two key modifications are introduced by the authors to make to address the exploration problem in ICRL

The authors introduced stochasticity in prompt construction by randomly sampling past episodes to include in the context. They also filtered out negative reward examples, focusing only on positive rewards in the context.



GaBZ6HPWQAAT4qF.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693

1/11
@paulgauthier
Llama-3.1-Nemotron-70B-Instruct-HF scored 55% on aider's leaderboard, just behind plain llama-3.1-70b-instruct.

59% llama-3.1-70b-instruct
55% Llama-3.1-Nemotron-70B-Instruct-HF

Aider LLM Leaderboards



GaBzrXUb0AQEb1S.jpg


2/11
@ichrvk
Nemotron seems to game benchmarks really well. Real world usage just doesn't hold up.



3/11
@seveibar
Always gotta wait for the aider benchmark. I heard claims it was better than sonnet. Saved me a bunch of time evaluating it myself! 🫡



4/11
@one__zero__zero
The leaderboard has spoken.



5/11
@whoisnnamdi
Helpful...



6/11
@minosvasilias
literally the only benchmark that matters



7/11
@AlexTobiasDev
I can confirm that the model doesnt not live up to the hype.

IN real world tests it did nothing. At least for me.



8/11
@NathanS64855891
Fine tune fails to live up to the hype. A tail as old as time.



9/11
@TheXeophon
Thank you so much for doing these :smile:



10/11
@EmreCoklar
This is the only benchmark that means anything anymore



11/11
@jerzydejm
very cool to see you quick with evals, thanks!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693






1/11
@lmarena_ai
Big News from Chatbot Arena!

@01AI_YI's latest model Yi-Lightning has been extensively tested in Arena, collecting over 13K community votes!

Yi-Lightning has climbed to #6 in the Overall rankings (#9 in Style Control), matching top models like Grok-2. It delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @01AI_YI!

Meanwhile, GLM-4-Plus by Zhipu AI (@ChatGLM) has also entered the top 10, marking a strong surge for Chinese LLMs. They're quickly becoming highly competitive. Stay tuned for more!

More analysis below👇

[Quoted tweet]
Yi-Lightning is now in Chatbot Arena! The latest and most capable model from @01AI_Yi.

Come chat and vote at lmarena. ai. The leaderboard will be updated soon.


GZ8qDOBb0AY7Pc1.jpg


2/11
@lmarena_ai
Yi-Lightning Category Rankings:

- Overall: #6 (Style Control #9)
- Math: #3
- Coding: #4
- Hard Prompts: #4

GLM-4-Plus

- Overall: #9 (Style Control #15)
- Math: #8
- Coding: #4
- Hard Prompts: #6



GZ8sXScbkAAyzqy.jpg


3/11
@lmarena_ai
Confidence interval plot on model strength



GZ8ttxPa0AE2Twz.jpg


4/11
@lmarena_ai
When applying Style Control, both models' rankings drop a bit.



GZ8wCYbb0AQ3Lks.jpg


5/11
@lmarena_ai
Win rate heat map



GZ8wQ4fb0AE3n69.jpg


6/11
@lmarena_ai
Check out full result at http://lmarena.ai/leaderboard !



7/11
@nicdunz
stop showing the overall like it means anything. yall need to add weight to the categories. 4o latest is beating o1 and o1 mini overall bc its better in nothing but longer query and multi turn. that should not out it in first place.



8/11
@AIwithBenefits
Makes sense @01AI_Yi made another based model. Their Yi Large preview was arguably best non OAI/Anthropic model out there until Meta dropped Llama 3



9/11
@trlyng2
Massive strides for Chinese LLMs. Congrats to Yi-Lightning for reaching #6 on the leaderboard, and to GLM-4-Plus for breaking into the top 10



10/11
@MaziyarPanahi
Congratulations to @01AI_Yi 🙌🏼



11/11
@joserivera234
I think the benchmarks are not very well done. I use ChatGPT in a daily basis and models o1 are so much better than 4o, so being 4o above o1 doesn't look right to me. Almost everything is better on o1 preview and o1-mini.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693

1/1
@youraimarketer
A new agentic framework, AFLOW, is taking optimization to the next level.

It automates agentic workflows using Monte Carlo Tree Search, driving a 5.7% performance boost over state-of-the-art methods on key benchmarks like HotpotQA, DROP, HumanEval, and more.

AFLOW efficiently models workflows as interconnected LLM-invoking nodes, enhancing accuracy, efficiency, and robustness across tasks.

What's even more impressive?

This framework enables smaller models to outperform larger ones like GPT-4o—while slashing inference costs by 4.55%!

By automating the design of efficient workflows, it reduces both the computational load and financial investment needed for optimal LLM performance.

The code is coming soon...

[Quoted tweet]
🚀 We are thrilled to introduce AFlow: Automating Agentic Workflow Generation

AFLOW autonomously generates & optimizes workflows using Monte Carlo Tree Search:

• Beats human-designed workflows in Coding, Math & QA
• Achieves GPT-4o-level on HumanEval at just 4.55% of its cost
• 96.2% HumanEval accuracy using GPT-4o with AFLOW
• Generate custom workflows in 1.5h with just an eval function

Explore more:
📄 Paper: arxiv.org/abs/2410.10762
💻 Code will be available at github.com/geekan/MetaGPT/


GZ6YqOpWEAAZWuU.jpg

GZ5zVorXUAAUz1r.jpg

GZ5zVonXYAAiSvh.jpg

GZ5zVooWoAA4-0m.jpg

GZ5zVonXAAA2Ncu.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693





1/3
@rohanpaul_ai
Standard Vision Transformer FAILS on Abstraction and Reasoning Corpus (ARC) tasks.

So this paper proposes Vision Transformer (ViT) for Abstraction and Reasoning Corpus (ARC) tasks, ViTARC architecture for visual reasoning

📊 Results:

~100% test solve rate on >50% of 400 public ARC tasks
Achieves via supervised learning from input-output grids

[Quoted tweet]
We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop’s @arcprize. Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔


GaBbpZ4XwAAmYwz.jpg

GZ8UkeMW0AAIedx.jpg


2/3
@rohanpaul_ai
💡 Insights:

- Highlights ViT's representational deficiency for structured mappings

- Stresses importance of inductive biases for abstract visual reasoning

- Demonstrates need for task-specific architectural modifications



3/3
@HCSolakoglu
"~100% test solve rate on >50% of 400 public ARC tasks" holy...




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/11
@WenhaoLi29
We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop’s @arcprize. Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔



GZ8UkeMW0AAIedx.jpg


2/11
@WenhaoLi29
We investigated and found that there exist fundamental limitations in the vanilla Vision Transformer preventing it from performing visual abstract reasoning. We propose enhancements to address these shortcomings in our new paper “Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects” ([2410.06405] Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects)



GZ8Up4YWsAAx7sP.jpg


3/11
@WenhaoLi29
Specifically, we found that: 1) ViT has limited spatial awareness due to the representation of the images using flattened image patches. We address these issues by introducing special 2D visual Tokens that enable the ViT to become spatially aware. 2) ViT has limited access to positional and object information. We tackle this using object-based positional encodings that allow the ViT to attend to the correct pixels within the image grid.



4/11
@WenhaoLi29
Implementing our enhancements, our framework “ViTARC” saw a significant improvement from the vanilla ViT! Task-specific models were able to achieve 100% accuracy on over half of the 400 ARC training tasks.



GZ8U1vHXwAAjaFW.jpg


5/11
@WenhaoLi29
Learn more about our work here: [2410.06405] Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
w/ @yudongxuwil @ScottSanner @lyeskhalil



6/11
@tinycrops
what happens if you use illumination to encode spacial relationship information in the image? @_jason_today Building Real-Time Global Illumination: Radiance Cascades



GZ9LWtmXsAAc0g4.png


7/11
@WenhaoLi29
Looks great! And yes, for OPE in our paper, you can use any external source of objectness information.



8/11
@JonathanRoseD
I've been playing around with just this idea to better read ARC patterns, but you're miles ahead here. Will/when this model be open sourced for us to play with? 🤠



9/11
@WenhaoLi29
Yes, we're working on it! The enhancements we mentioned are not too hard to implement on a raw CodeT5 or T5, so you could give it a try directly in the meantime.



10/11
@rkarmani
Did you submit it for evaluation over the private ARC tasks?



11/11
@WenhaoLi29
No, this isn’t an ARC solver yet (still working on generalization), but a solver still needs to read grids, so the enhancements are definitely relevant.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,107
Reputation
8,072
Daps
153,693

1/9
@Xianbao_QIAN
Open Sora Plan has released the 1.3 version of their video generation model.

Open-Sora-Plan/docs/Report-v1.3.0.md at main · PKU-YuanGroup/Open-Sora-Plan

Love this /search?q=#BlackMythWukong story illustrated by AI!



2/9
@DataPlusEngine
it's honestly shocking the level of quality and professionalism they display considering this is an open source project. amazing work!



3/9
@ai_for_success
Damn this is 🔥



4/9
@MrForExample
Awesome🔥🔥🔥
The spirit of resisting authority is somewhat outstanding
Which is why, I'm not sure if this video will be allowed inside China's internet🙈🙉🙊



5/9
@1littlecoder
This is insane!



6/9
@Prashant_1722
This is more than real, Hollywood level.



7/9
@jg_community
Incredible what can be done with AI-generated art. Jul i've played this game 'Black Myth Wukong' myself and I'm loving how these illustrations bring the story to life.



8/9
@BeLikeS4M
This is so cool🔥



9/9
@DavideSpagocci
any website ready already for use it easy?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top